I Introduction
Nearest neighbor (NN) search plays a fundamental role in machine learning and information retrieval. Crossmodal retrieval, an application based on nearest neighbor search, has grabbed much research attention recently. It is natural that multimedia data have multiple modalities; these modalities may contribute correlated semantic information, such as videotag pairs in YouTube and imagetext pairs in Flickr. Crossmodal retrieval can return relevant results of one modality for a given query of another modality. For example, we can use text queries to retrieve images, and use image queries to retrieve texts. This retrieval paradigm provides a useful interface for users to search data across different modalities.
With the rapid growth of multimedia data, it is impractical to apply exhaustive search that consumes a tremendous computation resource in a largescale dataset. To address this issue, existing crossmodal retrieval methods mainly leverage the hashing technique to generate compact data representations. The goal of hashing is to embed the data points from the original space into a Hamming space as binary hash codes. It generally exploits inter/intra class correlations or underlying data distribution/manifold to learn a set of hash functions, so that similar binary codes are generated for similar data points. Hamming distance computation between binary codes enables a fast nearest neighbor search through hardwaresupported bit operations with least memory consumption. However, the use of hashing has to face the critical problem of quantization loss after binary embedding. Even though the hashing algorithms have proposed various learning strategies to reduce the loss, there exists an inevitable large information gap between a realvalued vector and the corresponding binary code. Searching nearest neighbors in the binary Hamming space is therefore less accurate than that in the realvalued Euclidean space.
In this paper, we propose to utilize a novel index scheme over binary hash codes for crossmodal retrieval. The proposed index scheme exploits a few binary bits of the hash code as the index code. An index structure is built by compiling reference data points with the same index codes into lists of an inverted table. Given a query, we estimate the relevance of each index code that implicitly reflects the probability distribution of nearest neighbors (ground truth) for the query. The estimation is realized by a prediction model that learns a nonlinear mapping between the query of one modality and the index space of another modality through deep learning. Then we traverse the index table from the top rank index codes with the highest relevance scores to retrieve high quality candidates for further examination. We evaluate the proposed index scheme adopted on three stateoftheart crossmodal hashing algorithms in two widelyused benchmark datasets. Experimental results show the proposed method can effectively improve the search performance, in terms of retrieval accuracy and computation time. The proposed index scheme can be built upon any binary code datasets generated by hashing algorithms to derive the following benefits:

Based on the built index structure, the retrieval process can achieve sublinear time complexity through inverted table lookup, compared with the exhaustive search that takes linear time complexity.

Given a query, the learned prediction model is employed to estimate the relevance scores of the index codes for a precise ranking, rather than ranking by inaccurate Hamming distances.
The remainder of this paper is organized as follows. In Section 2, we discuss the previous work about crossmodal retrieval. Section 3 presents the proposed probabilitybased index scheme and search method. Section 4 shows experimental results. Conclusion remarks are given in Section 5.
Ii Related Work
The hashing technique can be classified into three main categories: unimodal hashing, multiview hashing, and crossmodal hashing. Unimodal hashing derives binary hash codes from a single type of features. The seminal work includes localitysensitive hashing
[1] and iterative quantization [2]. Multiview hashing utilizes multiple types of features to learn better binary codes [3][4][5]. Crossmodal hashing (CMH) aims to facilitate information retrieval across different modalities. It usually embeds multiple heterogeneous data into a common latent space where the discriminability or similarity correlation is preserved.Existing CMH algorithms can be further divided into unsupervised and supervised approaches. Unsupervised crossmodal hashing algorithms basically employ the data distribution to learn hash functions without the label information. For example, composite correlation quantization (CCQ)[6] uses correlationmaximal mappings to transform data from different modality types into an isomorphic latent space. Unsupervised generative adversarial hashing[7] exploits generative adversarial networks to train a generative model and a discriminative model. A correlation graph is used to capture the underlying manifold structure across different modalities. Fusion similarity hashing (FSH)[8] constructs an undirected asymmetric graph to model the fusion similarity among different modalities and embeds the fusion similarity across modalities into a common Hamming space. On the other hand, supervised crossmodal hashing algorithms leverage the label information to assist the learning process. For example, deep crossmodal hashing (DCMH)[9] learns hash functions for corresponding modalities through deep neural networks (DNN). A crossmodal similarity matrix that is defined by class labels is employed to learn the hash functions, so the Hamming space can preserve the characteristics of the similarity matrix. Semanticspreserving hashing (SePH)[10]
transforms semantic affinities to a probability distribution and approximates it with hash codes by using kernel logistic regression. Discrete latent semantic hashing (DLSH)
[11] learns the latent semantic representations of different modalities and then projects them into the shared Hamming space. Discrete latent factor model (DLFH)[12]utilizes the discrete latent factor to model the supervised information and adopts the maximum likelihood loss function without relaxation. Deep discrete crossmodal hashing (DDCMH)
[13] learns discrete nonlinear hash functions by preserving the intramodality similarity at each hidden layer of the networks and the intermodality similarity at the output layer of each individual network. Semisupervised crossmodal hashing by generative adversarial network (SCHGAN)[14] employs the generative model to select margin examples of one modality from unlabeled data for a query of another modality, while the discriminative model tries to distinguish the generated examples and true positive examples with respect to the query.Iii Hash Code Indexing
Figure 1 illustrates the proposed search framework with an example of retrieving text documents for an image query. It consists of the training part and the search part. In the training part, a CMH method is used to generate a reference dataset of binary codes of images and texts. An inverted index table is created to organize the reference dataset. Then a prediction model is trained to estimate the relevance of each index code of the index table. In the search part, the given image query is submitted to the prediction model to rank index codes based on their estimated relevance scores. Candidates are retrieved from the toprank index codes and reranked to output nearest neighbors of text documents in response to the query. We elaborate the two parts in the following.
Iiia Index Construction and Training
Suppose that we have a reference dataset of N binary codes of length c, denoted as . The binary codes can be generated by any one of the CMH algorithms. We select the first d binary bits from as the index code . An index table with entries is constructed based on index codes, where each entry represents a particular index code X and attaches a set of associated reference data points:
(1) 
We train a prediction model that learns a nonlinear mapping between the query of one modality (e.g., texts) and the index space of another modality (e.g., images) through deep learning. The model is used to estimate the relevance scores of index codes for a given query. To compile the training dataset, we prepare a set of queries of one modality, denoted as , where is the jth query. The relevant examples of another modality for are denoted as , where is the kth relevant example for . The definition of the relevant example is based on the class label information. For example, the relevant examples of a text query are the images whose class is the same to the query. The relevance score for each index code X is defined by the proportion of relevant examples to the entry size:
(2) 
where denotes the set cardinality. The training set is compiled as pairs of query features and relevance scores; the jth query is associated with the set of relevance scores of index codes .
A fullyconnected neural network is employed to learn the relation between the query and index codes based on the training set. The input layer receives the feature representation of , and the output layer predicts relevance scores of index codes . Based on the crossentropy loss between the predictions and the target
, we compute the error derivative with respect to the output of each neuron, which is backward propagated to each layer in order to update the weights of the neural network.
IiiB NN Search
Given a query q for crossmodal retrieval, we utilize the trained network to predict the relevance scores of index codes . The index codes are ranked to select the topR index codes with the highest relevance scores, and the reference data points associated with the topranking index codes are retrieved in a candidate set . We calculate the Hamming distance between the query and each of the candidates in C, then sort the distances of the candidates in ascending order to return the desired number of NNs.
The time complexity for NN search mainly involves three parts, namely, the relevance score prediction, index code ranking, and candidate computation. The time spent for relevance score prediction is related to the size of the neural network; it is regarded as a constant time. Index code ranking requires to sort all index codes based on their relevance scores; it takes computation time. Candidate computation is to compute the Hamming distances to the query for all candidates; it spends , where s is a tiny constant time for computing the Hamming distance. The candidate set C is usually a fraction of the reference dataset B, so we can reduce the computation time significantly compared with exhaustive search. Interestingly, the quality of the candidate set is extremely good to further boost the search accuracy, as illustrated in the experimental section.
Iv Experiment
To evaluate the proposed method, the experiment is conducted by using three stateoftheart CMH algorithms on two widelyused benchmark datasets. The benchmark datasets are MIRFlickr[15] and NUSWIDE[16], each of which consists of an image modality and a text modality. Table I summarizes the properties of the two benchmark datasets, which are then used to produce the CMH datasets. The original MIRFlickr dataset has 25000 instances collected from the Flickr website. Each instance consists of an image, associated textual tag, and one or more of 24 predefined semantic labels. We removed textual tags that appear less than 20 times in the dataset, and then deleted instances that without any textual tag or semantic label. For each instance, its image view is characterized by a 150D edge histogram, and its text view is represented as a 500D feature vector derived from PCA on its binary tagging vector with respect to the textual tags. We took 5% of MIRFlickr data to form the query set and the rest as the reference set. 10000 instances were sampled from the reference set for training. The groundtruth neighbors were defined as those imagetext pairs which share at least one common label.
For the original NUSWIDE dataset, it has 260648 instances, each of which consists of an image and one or more of 81 predefined semantic labels. We selected 195834 imagetext pairs that belong to the 21 most frequent concepts. The text for each point is represented as a 1000dimensional bagofword vector. The handcrafted feature for each image is a 500dimensional bagofvisual word (BOVW) vector. We used 2000 data points as the query set and the remaining points as the reference set. 20000 data points were sampled from the reference set for training. The ground truth neighbors were defined as those imagetext pairs which share at least one common label, as the same to MIRFlickr.
The CMH algorithms, including SePH[10], DCMH[9] and CCQ[6], are employed to generate binary code datasets for MIRFlickr and NUSWIDE. The program was implemented in Python and run on a PC with Intel i7 CPU@3.6 GHz and 32GB RAM.
Dataset  MIRFlickr  NUSWIDE 

Reference set  15902  193834 
Training set  10000  20000 
Query set  836  2000 
Number of Labels  24  21 
Iva Implementation and Comparison
For each CMH algorithm, three kinds of index schemes are implemented for comparison:

Exhaustive. It applies the exhaustive search that calculates Hamming distances between the query and all reference data without adopting any index structure.

Naïveindex (d bits). It takes the first d bits of the hash code as the index code for each reference data point. The given query compares the index code to find candidates and then rerank the candidates according to their Hamming distances. Here .

DNNindex (d bits).
It is the proposed method. In addition to the naïve index structure, we learn a 3layer neural network to rank index codes. The network is configured as I1H2H3O4, where I1 is the input layer, H2 and H3 are hidden layers with the same number of units as I1, and O4 is the output layer. ReLU and softmax are used as the activation functions for the hidden layers and output layer, respectively. Here
.
Mean average precision (MAP) is used to evaluate the retrieval accuracy for a set of queries Q:
(3) 
where is the number of retrieved documents, denotes the precision of the top retrieved documents, and if the th retrieved document is relevant to the query, otherwise . The relevant documents are defined as those imagetext pairs which share at least one common label. MAP is computed as the mean of all the queries’ average precision.
Figures 2 and 3 show the results for the search modality “text query vs. image dataset” () in MIRFlickr and NUSWIDE datasets, respectively, and Figures 4 and 5 demonstrate another search modality “image query vs. text dataset” (). The Xaxis and Yaxis represent the number of retrieved examples R and MAP@R, respectively. Except for the MIRFlickrCCQ dataset, the exhaustive and naïve index schemes have similar MAP curves. The former scheme did not benefit from reranking all reference data since their Hamming distances do not accurate enough to reflect the similarities to the query. However, the latter scheme can reach a comparable accuracy by taking only a few candidates for reranking. Moreover, with the proposed DNNguided index scheme, we can effectively boost the accuracy compared with the above baseline schemes. We observe that the longer index code performed stably and yielded close MAP curves across various CMH algorithms. In addition, the longer index code generated a more compact candidate list. Tables II and III compare our method with these CMH algorithms for and , respectively, in terms of MAP@50, the fraction of accessed reference data (ARD%), and runtime. ARD% is defined by:
(4) 
A lower ARD% means a smaller computation cost due to less memory access operations for the reference data. The 14bit DNNindex scheme, which obtained the highest accuracy and smallest computation cost, showed a significant improvement when it integrated with these CMH methods.
MIRFlickr  NUSWIDE  
16bit  32bit  16bit  32bit  
MAP@50  ARD%  time  MAP@50  ARD%  time  MAP@50  ARD%  time  MAP@50  ARD%  time  
SePH  0.7137  100%  1.40  0.7493  100%  2.07  0.5303  100%  13.33  0.5992  100%  20.83 
DCMH  0.7451  100%  1.49  0.7660  100%  2.19  0.5777  100%  13.20  0.5961  100%  19.80 
CCQ  0.4842  100%  1.70  0.4539  100%  3.04  0.1666  100%  13.85  0.1565  100%  22.35 
DNNindex (14 bits)  0.8753  0.33%  1.04  0.8754  0.33%  1.04  0.7564  0.23%  1.27  0.7583  0.23%  1.27 
MIRFlickr  NUSWIDE  
16bit  32bit  16bit  32bit  
MAP@50  ARD%  time  MAP@50  ARD%  time  MAP@50  ARD%  time  MAP@50  ARD%  time  
SePH  0.5992  100%  1.39  0.6179  100%  2.10  0.3747  100%  13.29  0.4037  100%  20.54 
DCMH  0.6899  100%  1.54  0.7075  100%  2.44  0.4823  100%  14.97  0.6005  100%  19.34 
CCQ  0.4011  100%  1.77  0.3996  100%  2.94  0.1601  100%  13.92  0.1530  100%  22.97 
DNNindex (14 bits)  0.8803  0.32%  0.93  0.8803  0.32%  0.93  0.6775  0.03%  1.02  0.6775  0.03%  1.02 
V Conclusion
In this paper, we propose a novel search method that utilizes a probabilitybased index scheme over binary hash codes in crossmodal retrieval. The index scheme, which ranks the hash index codes of the inverted table through DNN, can effectively increase the search accuracy and decrease the computation cost. Extensive experimental results show the superiority of the proposed method compared with other baselines.
Acknowledgement
This work was supported by the Ministry of Science and Technology, Taiwan, under grants MOST 1062221E415019MY3.
References

[1]
P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in
Proceedings of the ACM Symposium on Theory of Computing
, pp. 604613, 1998. 
[2]
Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 29162929, 2013.  [3] D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 225234, 2011.
 [4] J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiple feature hashing for largescale nearduplicate video retrieval,” IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 19972008, 2013.
 [5] L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for efficient image search,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 956966, 2015.
 [6] M. Long, Y. Cao, J. Wang, and P. S. Yu, “Composite correlation quantization for efficient multimodal retrieval,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 579588, 2016.

[7]
J. Zhang, Y. Peng, and M. Yuan, “Unsupervised generative adversarial crossmodal hashing,” in
the AAAI Conference on Artificial Intelligence
, 2018. 
[8]
H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang, “Crossmodality binary code learning via fusion similarity hashing,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 73807388, 2017.  [9] Q. Y. Jiang and W. J. Li, “Deep crossmodal hashing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 32323240, 2017.
 [10] Z. Lin, G. Ding, J. Han, and J. Wang, “Crossview retrieval via probabilitybased semanticspreserving hashing,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 43424355, 2017.
 [11] X. Lu, L. Zhu, Z. Cheng, X. Song, and H. Zhang, “Efficient discrete latent semantic hashing for scalable crossmodal retrieval,” Signal Processing, vol. 154, pp. 21731, 2019.
 [12] Q. Y. Jiang, and W. J. Li, “Discrete latent factor model for crossmodal hashing,” IEEE Transactions on Image Processing, 2019, in press.
 [13] F. Zhong, Z. Chen, and G. Min, “Deep discrete crossmodal hashing for crossmedia retrieval.” Pattern Recognition, vol. 83, pp. 6477, 2018.
 [14] J. Zhang, Y. Peng, and M. Yuan, “SCHGAN: Semisupervised crossmodal hashing by generative adversarial network,” IEEE Transactions on Cybernetics, 2018, in press.
 [15] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in Proceedings of the ACM International Conference on Multimedia Information Retrieval, pp. 3943, 2008.
 [16] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: a realworld web image database from national university of singapore,” in Proceedings of the ACM International Conference on Image and Video Retrieval, p. 48, 2009.
Comments
There are no comments yet.