I Introduction
Recently, the fast development of Internet and the explosive growth of multimedia including text, image, video, audio has greatly enriched people’s life, but magnified the challenge of information retrieval. Representative image retrieval methods , such as Waveletbased salient points
[1], Colorbased image retrieval [2], Contour Points Distribution Histogram(CPDH) [3], Inverse Document Frequency (IDF) [4], Contentbased image retrieval [5], are not directly applied in the application with multiple modalities. Multimodal data refers to those data of different types but with the same content information, for example, video clips and music, photos and tweets recording a concert. Crossmodal retrieval which aims to take one type of data as query to retrieve the relevant data objects of another type has attracted much attention.The crossmodal retrieval needs to solve two basic problems: one is how to measure the relevance between heterogeneous modalities and another is how to learn discriminative feature representation for each modal. For the first problem, most methods learn a latent common subspace where all modalites is comparable [6]. These approaches which are dubbed as subspace learning models include unsupervised approaches [7][8][9], supervised approaches [10][11]
, deep neural network based approaches
[12][13], etc. Common space learning approaches have achieved great improvement to crossmodal retrieval tasks. However, most of them force all modalities to transform into a designed low dimensional subspace, which results in information loss due to constraint on each other. Therefore, in this paper, we learn subspace for each modal separately based on the Hilbert space theory.To solve the second problem mentioned above, many models [14] [15] [16] employ the label information to enhance the discriminative ability of learned features. However, most of them incorporate discriminative label information just on one or final step of model. Actually, discriminative information may be lost at each stage from input to output of model.
In this paper, we regard the label space as a modal space and measure the correlations among kernel matrices of multiple modalities based on the HilberSchmidt independence criteria. At the same time, we use the label vectors to construct similarity graph in multimodal samples. Our model preserves the semantic structure within each modal by virtue of the constructed similarity graph in the process of subspace learning. The model proposed in this paper considers the intermodality correlation relationship, but also preserves intramodality similarity realtionship. Besides, inspired by
[17], the norm constraint is imposed on the projection matrices separately to select discriminative features for each modal. The main contributions of our work are summarized as follows:(1) The proposed learning model combines subspace learning, feature selection and semantic structure preserving into a joint framework. In addition, the convergence of the algorithm is analyzed theoretically.
(2) Supervised label information is employed to explore intramodality similarity among data and the intermodality correlation simultaneously.
(3) Comprehensive experiments are carried out on three widelyused datasets and the experimental results show the superiority and effectiveness of our model.
Structurally, the rest of this paper falls into three parts. The structure of CKD model is described in Section II, and the joint optimization process and convergence analysis of CKD is presented in Section III. The experimental results and analysis are depicted in Section IV, and finally we draw the conclusions of the paper.
Ii Our Method
Iia Problem Formulation
Assume that different modalities, denoted as . The th modality () contains samples with dimension. The classification label matrix is denoted by , where is the number of categories. if the th sample belongs to the th class; otherwise. Without loss of generality, samples are zerocentered for each modal, i.e. , ().
IiB HilbertSchmidt independence criteria
Suppose that and are two mapping functions signified by and respectively. The associated positive definite kernel and are formulated as and respectively. Given paired data samples , the expresssion of HilbertSchmidt independence criteria (HSIC) [18][19][20] is defined as
(1) 
where is a centering matrix, and .
IiC Model
For simplicity, we discuss our algorithm based on two modalities, i.e. Image and Text. It is easy to extend to the case with more modalities.
IiC1 Kernel Dependence Maximization
Different Hilbert spaces with the same dimension are isomorphic, thus we can conduct crossmodal retrieval tasks based on the Hilbert space theory. Semantic label shared by multiple modalities can be regarded as a modal data space. The correlations among multimodal data can be measured by calculating the similarity among multiple kernel matrices. The HSIC is introduced to calculate the similarity in our model. The kernel matrices of the th modal , where (). is the projection matrix of the th modal data. is denoted as the kernel matrix of the semantic label. According to the definition (1), the objective formulation can be given as follows:
(2) 
where . The correlations among different modalities is in essence intermodality similarity relationship.
IiC2 Discriminative Structure Preserving
Although different modalities locate isomeric spaces, they share the same semantic label. We construct the cosine similarity by leveraging semantic label vectors. Specifically, the similarity between the
th and the th entity with paired image and text is defined as follows(3) 
where denote the norm of the vector . The similarity calculated by above equation (3) is consistent for different modalities. We hope to preserve the similarity relationship among objects within each single modality in Hilbert space, which can be viewed as the intramodality similarity relationship. According to the constructed graph, the objective function can be formulated as follows
(4) 
where is graph Laplacian matrix, and are two adjustable parameters. As discussed in literatures [21] [22], the learning frameworks with norm constraint have some advantages of feature selection, sparsity and robustness to noise. In our model, we impose the norm constraint on projection matrices to learn more discriminative features and remove the redundant features. The objective function (4) can be rewritten as follows
(5) 
where and are two tradeoff parameters.
We integrate kernel dependence maximization and discriminative structure preserving into a joint framework by combing (2) and (5). The overall objective function can be written as follows
(6) 
where is an adjustable parameter. We usually set for simplicity. If , the model just consider intramodality relationship and ignore intermodality relationship.
IiD Optimization
To optimize all variables conveniently, we transform the norm form of matrix into trace form by adding an intermediate variable , where is the th row of . Besides, is set into 1. The objective function (6) can be rewritten as
(7) 
where .
IiD1 Optimization of
Keeping only the terms relating to , we can obtain
(8) 
where . The solution of which consists of the first eigenvectors corresponding to the largest eigenvalues of can be obtained using eigenvalue decomposition on .
IiD2 Optimization of
Keeping only the terms relating to , we can obtain
(9) 
where . Being similar to , the solution of can be got by performing the eigenvalue decomposition on .
IiE Convergence analysis
The detail optimization procedure is summarized in Algorithm 1. The process is repeated until the algorithm converges. The convergence curves on NUSWIDE, PascalSentence and MIRFlickr25k are plotted in Fig.3. From Fig.3., we can see that our method converges quickly.
The objective function (7) based on the optimizing rule in Algorithm 1 is decreasing monotonically, and it converges to the global minimum value.
Given any nonzero , , then the following formula (10) holds. Please refer to [23] for details.
(10) 
In order to prove the , we introduce the . The detail proof about is given.
The optimization of and are symmetrical in Algorithm 1, thus we just consider to prove one of them. The detail proof of process about the optimization of is provided below. The optimization problem with fixed is written as follows
(11) 
Letting , then . For the th iteration,
(12) 
By virtue of , we can obtain
(13) 
(14) 
Similarly, we can also prove that for the optimization of . Then, we have . Therefore, the proposed model based on the updating rules in Algorithm 1 is decreasing monotonically. Since the optimization problem is convex, the objective function finally converges to global optimal solution.
Iii Experiments
In this section, we conduct experiments on three benchmark datasets, i.e. PascalSentence [24], MIRFlickr [25] and NUSWIDE. Two evaluation protocols are introduced to evaluate our algorithm and the comparison with several classic crossmodal methods is also presented below.
Iiia Datasets
PascalSentence:
consists of 1000 samples with image and text. Each image is described by several sentences. The dataset is divided into 20 categories and each category includes 50 samples. We randomly select 30 samples from each category as training set and the rest for testing set. For each image, we employ convolutional neural network to extract 4096dimension CNN features. For text features, We ultilize the LDA model to get the probability of each topic, and the 100 dimensional probability vector is used to represent text features.
MIRFlickr:
contains original 25,000 images crawled from Flickr website. Each image amd its associated textual tags is called an instance. Each instance is manually classified into some of 24 classes. We only keep those instance whose textual tags appear at least 20 times and remove those instances without annotated labels or any textual tags. Each image is represented by 150dimensional edge histogram vector, and each text is represented as 500dimensional vector derived from PCA on bagof words vector. We randomly select 5% of the instances as the query set and 30% of the instance as the trainning set.
NUSWIDE: This dataset is a subset sampled from a realworld web image dataset [26] including 190,420 imagetext pairs with 21 possible labels. For each pair, image is represented by 500dimensional SIFT BoVW features, and 1000dimensional text annotations for text. The dataset contains 8,687 imagetext pairs which is divided into two parts: 5,212 pairs for training and 3,475 pairs for testing.
IiiB Experimental setting
The CKD proposed in this paper is an supervised, kernelbased and correlationbased method. We compare our algorithm with the following methods: Correlationbased method (CCA)[27]; Kernelbased and Correlationbased method (KCCA)[28]; Supervised and Correlationbased method (mlCCA)[29]; Supervised, Kernelbased and Correlationbased method (KDM)[18]. Besides, we compare our algorithm with itself whose is set as zero. We tune the parameter in the range of {10,20,30,40,50,60}. and are two coefficients of the regularization term and they are fixed as 0.01 in experiments. and are tradeoff parameters in the objective function. We set the possible values of and in the range of {1e5,1e4,1e3,1e2,1e1,1,10,1e2,1e3,1e4,1e5 } empirically. The best results are reported in this paper. Our experiments are implemented on MATLAB 2016b and Windows 10 (64Bit) platform based on desktop machine with 12 GB memory and 4core 3.6GHz CPU, and the model of the CPU is Intel(R) CORE(TM) i77700.
IiiC Evaluation Protocol
There are many evaluation metrics in information retrieval area. We introduce two commonly used indicators, i.e. Cumulative Match Characteristic Curve (CMC) with the rank{5,10,15,20,25,30} accuracies and Mean Average Precision (MAP). For a query
, the Average Precision (AP) is defined as E.q.(15)(15) 
where if the result of position is right and otherwise; represents the correct statistics of top retrieval results; The average value of AP of all queries is called MAP which indicates a better performance with larger value.
CMC is the probability statistics that the true retrieval results appear in differentsized candidate lists. Specifically, if retrieval results contain one or more objects classfied into the same class with query data, we think that this query can match true object. Assuming that the length of retrieval results is fixed as , the rate of true match in all query is denoted CMC.
IiiD Results
According to above settings and protocols, we perform two typical retrieval tasks: Image query text database and Text query Image database which are abbreviated as ’I2T’ and ’T2I’ respectively. I2T indicates image modality as query and text modality as retrieval objects, while T2I indicates text modality as query and image modality as retrieval objects. Table 1 shows the MAP results on PascalSentence, MIRFlickr and NUSWIDE. From the results reported in Table 1, we can observe that CKD outperfoms other comparative methods. KDM learns subspace by maximinizing kernel correlation, and CKD() just considers to preserve the structural consistency in the process of changing data space for each modality. Obviously, CKD is superior to KDM and CKD() in term of retrieval precision, which manifests that CKD integrating KDM and CKD() improves the retrieval performance.
Besides, to further valiate the effectiveness of CKD, we conduct some experiments on the basis of CMC. Fig. 3 and Fig. 4 show the performance variation of all approaches with respect to differentsized candidate lists on I2T and T2I respectively. It shows that our model achieves the best performance among all approaches.
Overall, above experimental results on PascalSentence, MIRFlickr and NUSWIDE indicate that CKD proposed in this paper is effectiveness for crossmodal retrieval.
Datasets  Approaches  Image as query  Text as query 

PascalSentence  CCA  0.0501  0.0456 
KCCA  0.0376  0.0402  
mlCCA  0.0422  0.0329  
CKD(=0)  0.0736  0.1300  
KDM  0.1729  0.1992  
CKD  0.2143  0.2806  
MIRFlickr  CCA  0.5466  0.5477 
KCCA  0.5521  0.5529  
mlCCA  0.5309  0.5302  
CKD(=0)  0.5602  0.5595  
KDM  0.5951  0.5823  
CKD  0.6103  0.5933  
NUSWIDE  CCA  0.3099  0.3103 
KCCA  0.3088  0.3174  
mlCCA  0.2787  0.2801  
CKD(=0)  0.3170  0.3164  
KDM  0.3247  0.3118  
CKD  0.4149  0.4211 
IiiE Parameter sensitivity analysis
In this section, we exploit the impact of the parameters involved in proposed model on retrieval precision. As formulated in (7), the two parameters and control the weight of two modalities respectively. We observe the performance variation by tuning the value of and in the range of { 1e5,1e4,1e3,1e2,1e1,1,10,1e2,1e3,1e4,1e5 }. Fig. 5 plots the performance of CKD I2T and T2I as a function and . From Fig.5, we can see that our model on PascalSentence is more sensitive to and than on MIRFlickr and NUSWIDE. In addition, for the dimension of Hilbert space , we carry out experiments on PascalSentence, MIRFlickr and NUSWIDE by changing the value of in the range of {10,20,30,40,50,60}. Fig.6 illustrates the MAP curve when raising in the candidate range. As shown in Fig.6, we can see that CKD achieves the best performance when is set as 50,50 and 60 on PascalSentence, MIRFlickr and NUSWIDE respectively.
IiiF Complexity analysis
In this section, we disscuss briefly the complexitey of proposed CKD. The time complexity of the Algorithm 1 is mainly that updating and by eigenvalue decomposition on and respectively. In each iteration, the eigenvalue decomposition on () costs . Assume the algorithm converges after iterations, the total complexity of our model is .
Iv Conclusion
In this paper, we present a novel method which integrating kernel correlation maximization and discriminative structure preserving into a joint optimization framework. Our model with the multiple use of label information can facilitate more discriminative subspace representation to realize crossmodal retrieval. Moreover, norm constraint imposed on project matrices enable to extract more discriminative feature and remove noisy feature. The experimental results on three publicy avaliable datasets show that our approach is effective and outperforms other several classic subspace learning algorithms.
Acknowledgment
THE PAPER IS SUPPORTED BY THE NATIONAL NATURAL SCIENCE FOUNDATION OF CHINA(GRANT NO.61373055, 61672265), MURI/EPSRC/DSTL GRANT EP/R018456/1, AND THE 111 PROJECT OF MINISTRY OF EDUCATION OF CHINA (GRANT NO. B12018).
References
 [1] Tian Q, Sebe N, Lew M S, et al. Image retrieval using waveletbased salient points[J]. Journal of Electronic Imaging, 2001, 10(4):835849.
 [2] Ciocca G, Marini D, Rizzi A, et al. Retinex preprocessing of uncalibrated images for colorbased image retrieval[J]. Journal of Electronic Imaging, 2003, 12(1):161172.
 [3] Shu X, Wu X J. A novel contour descriptor for 2D shape matching and its application to image retrieval[J]. Image and vision Computing, 2011, 29(4): 286294.
 [4] Zheng L , Wang S , Tian Q . Norm IDF for Scalable Image Retrieval[J]. Image Processing IEEE Transactions on, 2014, 23(8):36043617.
 [5] Feng L, Yu L, Zhu H. Spectral embeddingbased multiview features fusion for contentbased image retrieval[J]. Journal of Electronic Imaging, 2017, 26(5): 053002.
 [6] Wang K, He R, Wang L, et al. Joint feature selection and subspace learning for crossmodal retrieval[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 38(10): 20102023.
 [7] S. Akaho, A kernel method for canonical correlation analysis, in: Proceedings of the International Meeting of the Psychometric Society, 2007.

[8]
Sharma A, Jacobs D W. Bypassing synthesis: PLS for face recognition with pose, lowresolution and sketch[C]//CVPR 2011. IEEE, 2011: 593600.
 [9] Tenenbaum J B, Freeman W T. Separating style and content with bilinear models[J]. Neural computation, 2000, 12(6): 12471283.
 [10] Kim T K, Kittler J, Cipolla R. Discriminative learning and recognition of image set classes using canonical correlations[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6): 10051018.
 [11] Rasiwasia N, Costa Pereira J, Coviello E, et al. A new approach to crossmodal multimedia retrieval[C]//Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010: 251260.

[12]
Wang W, Arora R, Livescu K, et al. On deep multiview representation learning[C]//International Conference on Machine Learning. 2015: 10831092.
 [13] Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis[C]//International conference on machine learning. 2013: 12471255.

[14]
Gong Y , Ke Q , Isard M , et al. A MultiView Embedding Space for Modeling Internet Images, Tags, and their Semantics[J]. International Journal of Computer Vision, 2012, 106(2):210233.
 [15] Jacobs D W , Daume H , Kumar A , et al. Generalized Multiview Analysis: A discriminative latent space[C]// 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2012.
 [16] Lin D, Tang X. Intermodality face recognition[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2006: 1326.
 [17] Nie F , Huang H , Cai X , et al. Efficient and Robust Feature Selection via Joint ℓ2, 1Norms Minimization[C]// Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 69 December 2010, Vancouver, British Columbia, Canada. Curran Associates Inc. 2010.
 [18] Xu M, Zhu Z, Zhao Y, et al. Subspace learning by kernel dependence maximization for crossmodal retrieval[J]. Neurocomputing, 2018, 309: 94105.
 [19] Davis J V, Kulis B, Jain P, et al. Informationtheoretic metric learning[C]//Proceedings of the 24th international conference on Machine learning. ACM, 2007: 209216.
 [20] Principe J C. Information theory, machine learning, and reproducing kernel Hilbert spaces[M]//Information theoretic learning. Springer, New York, NY, 2010: 145.
 [21] Song T , Cai J , Zhang T , et al. Semisupervised manifoldembedded hashing with joint feature representation and classifier learning[J]. Pattern Recognition, 2017, 68:99110.
 [22] Wang D , Wang Q , Gao X . Robust and Flexible Discrete Hashing for CrossModal Similarity Search[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017:11.
 [23] Nie F, Huang H, Cai X, et al. Efficient and robust feature selection via joint ℓ2, 1norms minimization[C]//Advances in neural information processing systems. 2010: 18131821.
 [24] Wei Y, Zhao Y, Lu C, et al. Crossmodal retrieval with CNN visual features: A new baseline[J]. IEEE transactions on cybernetics, 2017, 47(2): 449460.
 [25] Huiskes M J, Lew M S. The MIR flickr retrieval evaluation[C]//Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 2008: 3943.
 [26] Chua T S, Tang J, Hong R, et al. NUSWIDE: a realworld web image database from National University of Singapore[C]//Proceedings of the ACM international conference on image and video retrieval. ACM, 2009: 48.
 [27] Hardoon D R, Szedmak S, ShaweTaylor J. Canonical correlation analysis: An overview with application to learning methods[J]. Neural computation, 2004, 16(12): 26392664.
 [28] Lisanti G, Masi I, Del Bimbo A. Matching people across camera views using kernel canonical correlation analysis[C]//Proceedings of the International Conference on Distributed Smart Cameras. ACM, 2014: 10.
 [29] Ranjan V, Rasiwasia N, Jawahar C V. Multilabel crossmodal retrieval[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 40944102.
Comments
There are no comments yet.