Cross-modal subspace learning with Kernel correlation maximization and Discriminative structure preserving

03/26/2019 ∙ by Jun Yu, et al. ∙ Alibaba Cloud NetEase, Inc 0

The measure between heterogeneous data is still an open problem. Many research works have been developed to learn a common subspace where the similarity between different modalities can be calculated. However, most of existing works focus on learning low dimensional subspace and ignore the loss of discriminative information in process of reducing dimension. Thus, these approaches cannot get the results they expected. On basis of the Hilbert space theory in which different Hilbert spaces but with same dimension are isomorphic, we propose a novel framework where the multiple use of label information can facilitate more discriminative subspace representation to learn isomorphic Hilbert space for each modal. Our model not only considers the inter-modality correlation by maximizing the kernel correlation, but also preserves the structure information within each modal according to constructed graph model. Extensive experiments are performed to evaluate the proposed framework, termed Cross-modal subspace learning with Kernel correlation maximization and Discriminative structure preserving (CKD), on the three public datasets. Experimental results demonstrated the competitive performance of the proposed CKD compared with the classic subspace learning methods.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, the fast development of Internet and the explosive growth of multimedia including text, image, video, audio has greatly enriched people’s life, but magnified the challenge of information retrieval. Representative image retrieval methods , such as Wavelet-based salient points

[1], Color-based image retrieval [2], Contour Points Distribution Histogram(CPDH) [3], Inverse Document Frequency (IDF) [4], Content-based image retrieval [5], are not directly applied in the application with multiple modalities. Multimodal data refers to those data of different types but with the same content information, for example, video clips and music, photos and tweets recording a concert. Cross-modal retrieval which aims to take one type of data as query to retrieve the relevant data objects of another type has attracted much attention.
   The cross-modal retrieval needs to solve two basic problems: one is how to measure the relevance between heterogeneous modalities and another is how to learn discriminative feature representation for each modal. For the first problem, most methods learn a latent common subspace where all modalites is comparable [6]. These approaches which are dubbed as subspace learning models include unsupervised approaches [7][8][9], supervised approaches [10][11]

, deep neural network based approaches

[12][13], etc. Common space learning approaches have achieved great improvement to cross-modal retrieval tasks. However, most of them force all modalities to transform into a designed low dimensional subspace, which results in information loss due to constraint on each other. Therefore, in this paper, we learn subspace for each modal separately based on the Hilbert space theory.
   To solve the second problem mentioned above, many models [14] [15] [16] employ the label information to enhance the discriminative ability of learned features. However, most of them incorporate discriminative label information just on one or final step of model. Actually, discriminative information may be lost at each stage from input to output of model.

   In this paper, we regard the label space as a modal space and measure the correlations among kernel matrices of multiple modalities based on the Hilber-Schmidt independence criteria. At the same time, we use the label vectors to construct similarity graph in multi-modal samples. Our model preserves the semantic structure within each modal by virtue of the constructed similarity graph in the process of subspace learning. The model proposed in this paper considers the inter-modality correlation relationship, but also preserves intra-modality similarity realtionship. Besides, inspired by

[17], the -norm constraint is imposed on the projection matrices separately to select discriminative features for each modal. The main contributions of our work are summarized as follows:

   (1) The proposed learning model combines subspace learning, feature selection and semantic structure preserving into a joint framework. In addition, the convergence of the algorithm is analyzed theoretically.

   (2) Supervised label information is employed to explore intra-modality similarity among data and the inter-modality correlation simultaneously.
   (3) Comprehensive experiments are carried out on three widely-used datasets and the experimental results show the superiority and effectiveness of our model.
   Structurally, the rest of this paper falls into three parts. The structure of CKD model is described in Section II, and the joint optimization process and convergence analysis of CKD is presented in Section III. The experimental results and analysis are depicted in Section IV, and finally we draw the conclusions of the paper.

Fig. 1: The illustration of the model proposed in this paper. In training phase, Isomerous two modalities are mapped to their Hilbert space respectively. For each modal, samples in Hilbert space preserve the structure information of original space. The correlations among multiple modalities are maximized. In testing phase, we can obtain the feature representation of an arbitrary query in learned Hilbert space, and another modal data with same semantic content are returned from database.

Ii Our Method

Ii-a Problem Formulation

Assume that different modalities, denoted as . The -th modality () contains samples with dimension. The classification label matrix is denoted by , where is the number of categories. if the -th sample belongs to the -th class; otherwise. Without loss of generality, samples are zero-centered for each modal, i.e. , ().

Ii-B Hilbert-Schmidt independence criteria

Suppose that and are two mapping functions signified by and respectively. The associated positive definite kernel and are formulated as and respectively. Given paired data samples , the expresssion of Hilbert-Schmidt independence criteria (HSIC) [18][19][20] is defined as


where is a centering matrix, and .

Ii-C Model

For simplicity, we discuss our algorithm based on two modalities, i.e. Image and Text. It is easy to extend to the case with more modalities.

Ii-C1 Kernel Dependence Maximization

Different Hilbert spaces with the same dimension are isomorphic, thus we can conduct cross-modal retrieval tasks based on the Hilbert space theory. Semantic label shared by multiple modalities can be regarded as a modal data space. The correlations among multi-modal data can be measured by calculating the similarity among multiple kernel matrices. The HSIC is introduced to calculate the similarity in our model. The kernel matrices of the -th modal , where (). is the projection matrix of the -th modal data. is denoted as the kernel matrix of the semantic label. According to the definition (1), the objective formulation can be given as follows:


where . The correlations among different modalities is in essence inter-modality similarity relationship.

Ii-C2 Discriminative Structure Preserving

Although different modalities locate isomeric spaces, they share the same semantic label. We construct the cosine similarity by leveraging semantic label vectors. Specifically, the similarity between the

-th and the -th entity with paired image and text is defined as follows


where denote the -norm of the vector . The similarity calculated by above equation (3) is consistent for different modalities. We hope to preserve the similarity relationship among objects within each single modality in Hilbert space, which can be viewed as the intra-modality similarity relationship. According to the constructed graph, the objective function can be formulated as follows


where is graph Laplacian matrix, and are two adjustable parameters. As discussed in literatures [21] [22], the learning frameworks with -norm constraint have some advantages of feature selection, sparsity and robustness to noise. In our model, we impose the -norm constraint on projection matrices to learn more discriminative features and remove the redundant features. The objective function (4) can be rewritten as follows


where and are two trade-off parameters.
We integrate kernel dependence maximization and discriminative structure preserving into a joint framework by combing (2) and (5). The overall objective function can be written as follows


where is an adjustable parameter. We usually set for simplicity. If , the model just consider intra-modality relationship and ignore inter-modality relationship.

Ii-D Optimization

To optimize all variables conveniently, we transform the -norm form of matrix into trace form by adding an intermediate variable , where is the -th row of . Besides, is set into 1. The objective function (6) can be rewritten as


where .

Ii-D1 Optimization of

Keeping only the terms relating to , we can obtain


where . The solution of which consists of the first eigenvectors corresponding to the largest eigenvalues of can be obtained using eigenvalue decomposition on .

Ii-D2 Optimization of

Keeping only the terms relating to , we can obtain


where . Being similar to , the solution of can be got by performing the eigenvalue decomposition on .

0:  , label matrix , the dimension in Hilbert space, , .
0:  Initialize .Calculating similarity matrix according to (3).
1:  repeat
2:     Compute by .
3:     Compute by .
4:     Update using Eq.(8)
5:     Update using Eq.(9)
6:  until 
Algorithm 1 The algorithm proposed in this paper (CKD)

Ii-E Convergence analysis

The detail optimization procedure is summarized in Algorithm 1. The process is repeated until the algorithm converges. The convergence curves on NUS-WIDE, Pascal-Sentence and MIRFlickr25k are plotted in Fig.3. From Fig.3., we can see that our method converges quickly.
The objective function (7) based on the optimizing rule in Algorithm 1 is decreasing monotonically, and it converges to the global minimum value.
Given any nonzero , , then the following formula (10) holds. Please refer to [23] for details.


In order to prove the , we introduce the . The detail proof about is given.
The optimization of and are symmetrical in Algorithm 1, thus we just consider to prove one of them. The detail proof of process about the optimization of is provided below. The optimization problem with fixed is written as follows


Letting , then . For the -th iteration,


By virtue of , we can obtain


Similarly, we can also prove that for the optimization of . Then, we have . Therefore, the proposed model based on the updating rules in Algorithm 1 is decreasing monotonically. Since the optimization problem is convex, the objective function finally converges to global optimal solution.

Iii Experiments

In this section, we conduct experiments on three benchmark datasets, i.e. Pascal-Sentence [24], MIRFlickr [25] and NUS-WIDE. Two evaluation protocols are introduced to evaluate our algorithm and the comparison with several classic cross-modal methods is also presented below.

Iii-a Datasets


consists of 1000 samples with image and text. Each image is described by several sentences. The dataset is divided into 20 categories and each category includes 50 samples. We randomly select 30 samples from each category as training set and the rest for testing set. For each image, we employ convolutional neural network to extract 4096-dimension CNN features. For text features, We ultilize the LDA model to get the probability of each topic, and the 100 dimensional probability vector is used to represent text features.


contains original 25,000 images crawled from Flickr website. Each image amd its associated textual tags is called an instance. Each instance is manually classified into some of 24 classes. We only keep those instance whose textual tags appear at least 20 times and remove those instances without annotated labels or any textual tags. Each image is represented by 150-dimensional edge histogram vector, and each text is represented as 500-dimensional vector derived from PCA on bag-of words vector. We randomly select 5% of the instances as the query set and 30% of the instance as the trainning set.

   NUS-WIDE: This dataset is a subset sampled from a real-world web image dataset [26] including 190,420 image-text pairs with 21 possible labels. For each pair, image is represented by 500-dimensional SIFT BoVW features, and 1000-dimensional text annotations for text. The dataset contains 8,687 image-text pairs which is divided into two parts: 5,212 pairs for training and 3,475 pairs for testing.

Iii-B Experimental setting

The CKD proposed in this paper is an supervised, kernel-based and correlation-based method. We compare our algorithm with the following methods: Correlation-based method (CCA)[27]; Kernel-based and Correlation-based method (KCCA)[28]; Supervised and Correlation-based method (ml-CCA)[29]; Supervised, Kernel-based and Correlation-based method (KDM)[18]. Besides, we compare our algorithm with itself whose is set as zero. We tune the parameter in the range of {10,20,30,40,50,60}. and are two coefficients of the regularization term and they are fixed as 0.01 in experiments. and are trade-off parameters in the objective function. We set the possible values of and in the range of {1e-5,1e-4,1e-3,1e-2,1e-1,1,10,1e2,1e3,1e4,1e5 } empirically. The best results are reported in this paper. Our experiments are implemented on MATLAB 2016b and Windows 10 (64-Bit) platform based on desktop machine with 12 GB memory and 4-core 3.6GHz CPU, and the model of the CPU is Intel(R) CORE(TM) i7-7700.

Iii-C Evaluation Protocol

There are many evaluation metrics in information retrieval area. We introduce two commonly used indicators, i.e. Cumulative Match Characteristic Curve (CMC) with the rank-{5,10,15,20,25,30} accuracies and Mean Average Precision (MAP). For a query

, the Average Precision (AP) is defined as E.q.(15)


where if the result of position is right and otherwise; represents the correct statistics of top retrieval results; The average value of AP of all queries is called MAP which indicates a better performance with larger value.
   CMC is the probability statistics that the true retrieval results appear in different-sized candidate lists. Specifically, if retrieval results contain one or more objects classfied into the same class with query data, we think that this query can match true object. Assuming that the length of retrieval results is fixed as , the rate of true match in all query is denoted CMC.

Iii-D Results

According to above settings and protocols, we perform two typical retrieval tasks: Image query text database and Text query Image database which are abbreviated as ’I2T’ and ’T2I’ respectively. I2T indicates image modality as query and text modality as retrieval objects, while T2I indicates text modality as query and image modality as retrieval objects. Table 1 shows the MAP results on Pascal-Sentence, MIRFlickr and NUS-WIDE. From the results reported in Table 1, we can observe that CKD outperfoms other comparative methods. KDM learns subspace by maximinizing kernel correlation, and CKD() just considers to preserve the structural consistency in the process of changing data space for each modality. Obviously, CKD is superior to KDM and CKD() in term of retrieval precision, which manifests that CKD integrating KDM and CKD() improves the retrieval performance.
   Besides, to further valiate the effectiveness of CKD, we conduct some experiments on the basis of CMC. Fig. 3 and Fig. 4 show the performance variation of all approaches with respect to different-sized candidate lists on I2T and T2I respectively. It shows that our model achieves the best performance among all approaches.
   Overall, above experimental results on Pascal-Sentence, MIRFlickr and NUS-WIDE indicate that CKD proposed in this paper is effectiveness for cross-modal retrieval.

Datasets Approaches Image as query Text as query
Pascal-Sentence CCA 0.0501 0.0456
KCCA 0.0376 0.0402
ml-CCA 0.0422 0.0329
CKD(=0) 0.0736 0.1300
KDM 0.1729 0.1992
CKD 0.2143 0.2806
MIRFlickr CCA 0.5466 0.5477
KCCA 0.5521 0.5529
ml-CCA 0.5309 0.5302
CKD(=0) 0.5602 0.5595
KDM 0.5951 0.5823
CKD 0.6103 0.5933
NUS-WIDE CCA 0.3099 0.3103
KCCA 0.3088 0.3174
ml-CCA 0.2787 0.2801
CKD(=0) 0.3170 0.3164
KDM 0.3247 0.3118
CKD 0.4149 0.4211
TABLE I: The MAP results on Pascal-Sentence, MIRFlickr and NUS-WIDE
(a) Pascal-Sentence
(b) MIRFlickr
Fig. 2: The convergence of algorithm 1 on Pascal-Sentence(a),MIRFlickr (b), and NUS-WIDE (c).
(a) Pascal-Sentence
(b) MIRFlickr
Fig. 3: The CMC curve of I2T on Pascal-Sentence(a),MIRFlickr (b), and NUS-WIDE (c).
(a) Pascal-Sentence
(b) MIRFlickr
Fig. 4: The CMC curve of T2I on Pascal-Sentence(a),MIRFlickr (b), and NUS-WIDE (c).
(a) Pascal-Sentence
(b) MIRFlickr
Fig. 5: Performance variation of the CKD with respect to and on Pascal-Sentence(a),MIRFlickr (b), and NUS-WIDE (c).
(a) Pascal-Sentence
(b) MIRFlickr
Fig. 6: Performance variation of the CKD with respect to on Pascal-Sentence(a),MIRFlickr (b), and NUS-WIDE (c).

Iii-E Parameter sensitivity analysis

In this section, we exploit the impact of the parameters involved in proposed model on retrieval precision. As formulated in (7), the two parameters and control the weight of two modalities respectively. We observe the performance variation by tuning the value of and in the range of { 1e-5,1e-4,1e-3,1e-2,1e-1,1,10,1e2,1e3,1e4,1e5 }. Fig. 5 plots the performance of CKD I2T and T2I as a function and . From Fig.5, we can see that our model on Pascal-Sentence is more sensitive to and than on MIRFlickr and NUS-WIDE. In addition, for the dimension of Hilbert space , we carry out experiments on Pascal-Sentence, MIRFlickr and NUS-WIDE by changing the value of in the range of {10,20,30,40,50,60}. Fig.6 illustrates the MAP curve when raising in the candidate range. As shown in Fig.6, we can see that CKD achieves the best performance when is set as 50,50 and 60 on Pascal-Sentence, MIRFlickr and NUS-WIDE respectively.

Iii-F Complexity analysis

In this section, we disscuss briefly the complexitey of proposed CKD. The time complexity of the Algorithm 1 is mainly that updating and by eigenvalue decomposition on and respectively. In each iteration, the eigenvalue decomposition on () costs . Assume the algorithm converges after iterations, the total complexity of our model is .

Iv Conclusion

In this paper, we present a novel method which integrating kernel correlation maximization and discriminative structure preserving into a joint optimization framework. Our model with the multiple use of label information can facilitate more discriminative subspace representation to realize cross-modal retrieval. Moreover, -norm constraint imposed on project matrices enable to extract more discriminative feature and remove noisy feature. The experimental results on three publicy avaliable datasets show that our approach is effective and outperforms other several classic subspace learning algorithms.




  • [1] Tian Q, Sebe N, Lew M S, et al. Image retrieval using wavelet-based salient points[J]. Journal of Electronic Imaging, 2001, 10(4):835-849.
  • [2] Ciocca G, Marini D, Rizzi A, et al. Retinex preprocessing of uncalibrated images for color-based image retrieval[J]. Journal of Electronic Imaging, 2003, 12(1):161-172.
  • [3] Shu X, Wu X J. A novel contour descriptor for 2D shape matching and its application to image retrieval[J]. Image and vision Computing, 2011, 29(4): 286-294.
  • [4] Zheng L , Wang S , Tian Q . -Norm IDF for Scalable Image Retrieval[J]. Image Processing IEEE Transactions on, 2014, 23(8):3604-3617.
  • [5] Feng L, Yu L, Zhu H. Spectral embedding-based multiview features fusion for content-based image retrieval[J]. Journal of Electronic Imaging, 2017, 26(5): 053002.
  • [6] Wang K, He R, Wang L, et al. Joint feature selection and subspace learning for cross-modal retrieval[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 38(10): 2010-2023.
  • [7] S. Akaho, A kernel method for canonical correlation analysis, in: Proceedings of the International Meeting of the Psychometric Society, 2007.
  • [8]

    Sharma A, Jacobs D W. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch[C]//CVPR 2011. IEEE, 2011: 593-600.

  • [9] Tenenbaum J B, Freeman W T. Separating style and content with bilinear models[J]. Neural computation, 2000, 12(6): 1247-1283.
  • [10] Kim T K, Kittler J, Cipolla R. Discriminative learning and recognition of image set classes using canonical correlations[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 1005-1018.
  • [11] Rasiwasia N, Costa Pereira J, Coviello E, et al. A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010: 251-260.
  • [12]

    Wang W, Arora R, Livescu K, et al. On deep multi-view representation learning[C]//International Conference on Machine Learning. 2015: 1083-1092.

  • [13] Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis[C]//International conference on machine learning. 2013: 1247-1255.
  • [14]

    Gong Y , Ke Q , Isard M , et al. A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics[J]. International Journal of Computer Vision, 2012, 106(2):210-233.

  • [15] Jacobs D W , Daume H , Kumar A , et al. Generalized Multiview Analysis: A discriminative latent space[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2012.
  • [16] Lin D, Tang X. Inter-modality face recognition[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2006: 13-26.
  • [17] Nie F , Huang H , Cai X , et al. Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization[C]// Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada. Curran Associates Inc. 2010.
  • [18] Xu M, Zhu Z, Zhao Y, et al. Subspace learning by kernel dependence maximization for cross-modal retrieval[J]. Neurocomputing, 2018, 309: 94-105.
  • [19] Davis J V, Kulis B, Jain P, et al. Information-theoretic metric learning[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 209-216.
  • [20] Principe J C. Information theory, machine learning, and reproducing kernel Hilbert spaces[M]//Information theoretic learning. Springer, New York, NY, 2010: 1-45.
  • [21] Song T , Cai J , Zhang T , et al. Semi-supervised manifold-embedded hashing with joint feature representation and classifier learning[J]. Pattern Recognition, 2017, 68:99-110.
  • [22] Wang D , Wang Q , Gao X . Robust and Flexible Discrete Hashing for Cross-Modal Similarity Search[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017:1-1.
  • [23] Nie F, Huang H, Cai X, et al. Efficient and robust feature selection via joint ℓ2, 1-norms minimization[C]//Advances in neural information processing systems. 2010: 1813-1821.
  • [24] Wei Y, Zhao Y, Lu C, et al. Cross-modal retrieval with CNN visual features: A new baseline[J]. IEEE transactions on cybernetics, 2017, 47(2): 449-460.
  • [25] Huiskes M J, Lew M S. The MIR flickr retrieval evaluation[C]//Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 2008: 39-43.
  • [26] Chua T S, Tang J, Hong R, et al. NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international conference on image and video retrieval. ACM, 2009: 48.
  • [27] Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: An overview with application to learning methods[J]. Neural computation, 2004, 16(12): 2639-2664.
  • [28] Lisanti G, Masi I, Del Bimbo A. Matching people across camera views using kernel canonical correlation analysis[C]//Proceedings of the International Conference on Distributed Smart Cameras. ACM, 2014: 10.
  • [29] Ranjan V, Rasiwasia N, Jawahar C V. Multi-label cross-modal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4094-4102.