Nearest neighbor (NN) search plays a fundamental role in machine learning and information retrieval. Cross-modal retrieval, an application based on nearest neighbor search, has grabbed much research attention recently. It is natural that multimedia data have multiple modalities; these modalities may contribute correlated semantic information, such as video-tag pairs in YouTube and image-text pairs in Flickr. Cross-modal retrieval can return relevant results of one modality for a given query of another modality. For example, we can use text queries to retrieve images, and use image queries to retrieve texts. This retrieval paradigm provides a useful interface for users to search data across different modalities.
With the rapid growth of multimedia data, it is impractical to apply exhaustive search that consumes a tremendous computation resource in a large-scale dataset. To address this issue, existing cross-modal retrieval methods mainly leverage the hashing technique to generate compact data representations. The goal of hashing is to embed the data points from the original space into a Hamming space as binary hash codes. It generally exploits inter/intra class correlations or underlying data distribution/manifold to learn a set of hash functions, so that similar binary codes are generated for similar data points. Hamming distance computation between binary codes enables a fast nearest neighbor search through hardware-supported bit operations with least memory consumption. However, the use of hashing has to face the critical problem of quantization loss after binary embedding. Even though the hashing algorithms have proposed various learning strategies to reduce the loss, there exists an inevitable large information gap between a real-valued vector and the corresponding binary code. Searching nearest neighbors in the binary Hamming space is therefore less accurate than that in the real-valued Euclidean space.
In this paper, we propose to utilize a novel index scheme over binary hash codes for cross-modal retrieval. The proposed index scheme exploits a few binary bits of the hash code as the index code. An index structure is built by compiling reference data points with the same index codes into lists of an inverted table. Given a query, we estimate the relevance of each index code that implicitly reflects the probability distribution of nearest neighbors (ground truth) for the query. The estimation is realized by a prediction model that learns a nonlinear mapping between the query of one modality and the index space of another modality through deep learning. Then we traverse the index table from the top rank index codes with the highest relevance scores to retrieve high quality candidates for further examination. We evaluate the proposed index scheme adopted on three state-of-the-art cross-modal hashing algorithms in two widely-used benchmark datasets. Experimental results show the proposed method can effectively improve the search performance, in terms of retrieval accuracy and computation time. The proposed index scheme can be built upon any binary code datasets generated by hashing algorithms to derive the following benefits:
Based on the built index structure, the retrieval process can achieve sub-linear time complexity through inverted table lookup, compared with the exhaustive search that takes linear time complexity.
Given a query, the learned prediction model is employed to estimate the relevance scores of the index codes for a precise ranking, rather than ranking by inaccurate Hamming distances.
The remainder of this paper is organized as follows. In Section 2, we discuss the previous work about cross-modal retrieval. Section 3 presents the proposed probability-based index scheme and search method. Section 4 shows experimental results. Conclusion remarks are given in Section 5.
Ii Related Work
The hashing technique can be classified into three main categories: uni-modal hashing, multi-view hashing, and cross-modal hashing. Uni-modal hashing derives binary hash codes from a single type of features. The seminal work includes locality-sensitive hashing and iterative quantization . Multi-view hashing utilizes multiple types of features to learn better binary codes . Cross-modal hashing (CMH) aims to facilitate information retrieval across different modalities. It usually embeds multiple heterogeneous data into a common latent space where the discriminability or similarity correlation is preserved.
Existing CMH algorithms can be further divided into unsupervised and supervised approaches. Unsupervised cross-modal hashing algorithms basically employ the data distribution to learn hash functions without the label information. For example, composite correlation quantization (CCQ) uses correlation-maximal mappings to transform data from different modality types into an isomorphic latent space. Unsupervised generative adversarial hashing exploits generative adversarial networks to train a generative model and a discriminative model. A correlation graph is used to capture the underlying manifold structure across different modalities. Fusion similarity hashing (FSH) constructs an undirected asymmetric graph to model the fusion similarity among different modalities and embeds the fusion similarity across modalities into a common Hamming space. On the other hand, supervised cross-modal hashing algorithms leverage the label information to assist the learning process. For example, deep cross-modal hashing (DCMH) learns hash functions for corresponding modalities through deep neural networks (DNN). A cross-modal similarity matrix that is defined by class labels is employed to learn the hash functions, so the Hamming space can preserve the characteristics of the similarity matrix. Semantics-preserving hashing (SePH)
transforms semantic affinities to a probability distribution and approximates it with hash codes by using kernel logistic regression. Discrete latent semantic hashing (DLSH) learns the latent semantic representations of different modalities and then projects them into the shared Hamming space. Discrete latent factor model (DLFH)
utilizes the discrete latent factor to model the supervised information and adopts the maximum likelihood loss function without relaxation. Deep discrete cross-modal hashing (DDCMH) learns discrete nonlinear hash functions by preserving the intra-modality similarity at each hidden layer of the networks and the inter-modality similarity at the output layer of each individual network. Semi-supervised cross-modal hashing by generative adversarial network (SCH-GAN) employs the generative model to select margin examples of one modality from unlabeled data for a query of another modality, while the discriminative model tries to distinguish the generated examples and true positive examples with respect to the query.
Iii Hash Code Indexing
Figure 1 illustrates the proposed search framework with an example of retrieving text documents for an image query. It consists of the training part and the search part. In the training part, a CMH method is used to generate a reference dataset of binary codes of images and texts. An inverted index table is created to organize the reference dataset. Then a prediction model is trained to estimate the relevance of each index code of the index table. In the search part, the given image query is submitted to the prediction model to rank index codes based on their estimated relevance scores. Candidates are retrieved from the top-rank index codes and reranked to output nearest neighbors of text documents in response to the query. We elaborate the two parts in the following.
Iii-a Index Construction and Training
Suppose that we have a reference dataset of N binary codes of length c, denoted as . The binary codes can be generated by any one of the CMH algorithms. We select the first d binary bits from as the index code . An index table with entries is constructed based on index codes, where each entry represents a particular index code X and attaches a set of associated reference data points:
We train a prediction model that learns a nonlinear mapping between the query of one modality (e.g., texts) and the index space of another modality (e.g., images) through deep learning. The model is used to estimate the relevance scores of index codes for a given query. To compile the training dataset, we prepare a set of queries of one modality, denoted as , where is the jth query. The relevant examples of another modality for are denoted as , where is the kth relevant example for . The definition of the relevant example is based on the class label information. For example, the relevant examples of a text query are the images whose class is the same to the query. The relevance score for each index code X is defined by the proportion of relevant examples to the entry size:
where denotes the set cardinality. The training set is compiled as pairs of query features and relevance scores; the jth query is associated with the set of relevance scores of index codes .
A fully-connected neural network is employed to learn the relation between the query and index codes based on the training set. The input layer receives the feature representation of , and the output layer predicts relevance scores of index codes . Based on the cross-entropy loss between the predictions and the target
, we compute the error derivative with respect to the output of each neuron, which is backward propagated to each layer in order to update the weights of the neural network.
Iii-B NN Search
Given a query q for cross-modal retrieval, we utilize the trained network to predict the relevance scores of index codes . The index codes are ranked to select the top-R index codes with the highest relevance scores, and the reference data points associated with the top-ranking index codes are retrieved in a candidate set . We calculate the Hamming distance between the query and each of the candidates in C, then sort the distances of the candidates in ascending order to return the desired number of NNs.
The time complexity for NN search mainly involves three parts, namely, the relevance score prediction, index code ranking, and candidate computation. The time spent for relevance score prediction is related to the size of the neural network; it is regarded as a constant time. Index code ranking requires to sort all index codes based on their relevance scores; it takes computation time. Candidate computation is to compute the Hamming distances to the query for all candidates; it spends , where s is a tiny constant time for computing the Hamming distance. The candidate set C is usually a fraction of the reference dataset B, so we can reduce the computation time significantly compared with exhaustive search. Interestingly, the quality of the candidate set is extremely good to further boost the search accuracy, as illustrated in the experimental section.
To evaluate the proposed method, the experiment is conducted by using three state-of-the-art CMH algorithms on two widely-used benchmark datasets. The benchmark datasets are MIRFlickr and NUS-WIDE, each of which consists of an image modality and a text modality. Table I summarizes the properties of the two benchmark datasets, which are then used to produce the CMH datasets. The original MIRFlickr dataset has 25000 instances collected from the Flickr website. Each instance consists of an image, associated textual tag, and one or more of 24 predefined semantic labels. We removed textual tags that appear less than 20 times in the dataset, and then deleted instances that without any textual tag or semantic label. For each instance, its image view is characterized by a 150-D edge histogram, and its text view is represented as a 500-D feature vector derived from PCA on its binary tagging vector with respect to the textual tags. We took 5% of MIRFlickr data to form the query set and the rest as the reference set. 10000 instances were sampled from the reference set for training. The ground-truth neighbors were defined as those image-text pairs which share at least one common label.
For the original NUS-WIDE dataset, it has 260648 instances, each of which consists of an image and one or more of 81 predefined semantic labels. We selected 195834 image-text pairs that belong to the 21 most frequent concepts. The text for each point is represented as a 1000-dimensional bag-of-word vector. The hand-crafted feature for each image is a 500-dimensional bag-of-visual word (BOVW) vector. We used 2000 data points as the query set and the remaining points as the reference set. 20000 data points were sampled from the reference set for training. The ground truth neighbors were defined as those image-text pairs which share at least one common label, as the same to MIRFlickr.
The CMH algorithms, including SePH, DCMH and CCQ, are employed to generate binary code datasets for MIRFlickr and NUS-WIDE. The program was implemented in Python and run on a PC with Intel i7 CPU@3.6 GHz and 32GB RAM.
|Number of Labels||24||21|
Iv-a Implementation and Comparison
For each CMH algorithm, three kinds of index schemes are implemented for comparison:
Exhaustive. It applies the exhaustive search that calculates Hamming distances between the query and all reference data without adopting any index structure.
Naïve-index (d bits). It takes the first d bits of the hash code as the index code for each reference data point. The given query compares the index code to find candidates and then rerank the candidates according to their Hamming distances. Here .
DNN-index (d bits).
It is the proposed method. In addition to the naïve index structure, we learn a 3-layer neural network to rank index codes. The network is configured as I1-H2-H3-O4, where I1 is the input layer, H2 and H3 are hidden layers with the same number of units as I1, and O4 is the output layer. ReLU and softmax are used as the activation functions for the hidden layers and output layer, respectively. Here.
Mean average precision (MAP) is used to evaluate the retrieval accuracy for a set of queries Q:
where is the number of retrieved documents, denotes the precision of the top retrieved documents, and if the th retrieved document is relevant to the query, otherwise . The relevant documents are defined as those image-text pairs which share at least one common label. MAP is computed as the mean of all the queries’ average precision.
Figures 2 and 3 show the results for the search modality “text query vs. image dataset” () in MIRFlickr and NUS-WIDE datasets, respectively, and Figures 4 and 5 demonstrate another search modality “image query vs. text dataset” (). The X-axis and Y-axis represent the number of retrieved examples R and MAP@R, respectively. Except for the MIRFlickr-CCQ dataset, the exhaustive and naïve index schemes have similar MAP curves. The former scheme did not benefit from reranking all reference data since their Hamming distances do not accurate enough to reflect the similarities to the query. However, the latter scheme can reach a comparable accuracy by taking only a few candidates for reranking. Moreover, with the proposed DNN-guided index scheme, we can effectively boost the accuracy compared with the above baseline schemes. We observe that the longer index code performed stably and yielded close MAP curves across various CMH algorithms. In addition, the longer index code generated a more compact candidate list. Tables II and III compare our method with these CMH algorithms for and , respectively, in terms of MAP@50, the fraction of accessed reference data (ARD%), and runtime. ARD% is defined by:
A lower ARD% means a smaller computation cost due to less memory access operations for the reference data. The 14-bit DNN-index scheme, which obtained the highest accuracy and smallest computation cost, showed a significant improvement when it integrated with these CMH methods.
|DNN-index (14 bits)||0.8753||0.33%||1.04||0.8754||0.33%||1.04||0.7564||0.23%||1.27||0.7583||0.23%||1.27|
|DNN-index (14 bits)||0.8803||0.32%||0.93||0.8803||0.32%||0.93||0.6775||0.03%||1.02||0.6775||0.03%||1.02|
In this paper, we propose a novel search method that utilizes a probability-based index scheme over binary hash codes in cross-modal retrieval. The index scheme, which ranks the hash index codes of the inverted table through DNN, can effectively increase the search accuracy and decrease the computation cost. Extensive experimental results show the superiority of the proposed method compared with other baselines.
This work was supported by the Ministry of Science and Technology, Taiwan, under grants MOST 106-2221-E-415-019-MY3.
P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in
Proceedings of the ACM Symposium on Theory of Computing, pp. 604-613, 1998.
Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916-2929, 2013.
-  D. Zhang, F. Wang, and L. Si, “Composite hashing with multiple information sources,” in Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 225-234, 2011.
-  J. Song, Y. Yang, Z. Huang, H. T. Shen, and J. Luo, “Effective multiple feature hashing for large-scale nearduplicate video retrieval,” IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1997-2008, 2013.
-  L. Liu, M. Yu, and L. Shao, “Multiview alignment hashing for efficient image search,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 956-966, 2015.
-  M. Long, Y. Cao, J. Wang, and P. S. Yu, “Composite correlation quantization for efficient multimodal retrieval,” in Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 579-588, 2016.
J. Zhang, Y. Peng, and M. Yuan, “Unsupervised generative adversarial cross-modal hashing,” in
the AAAI Conference on Artificial Intelligence, 2018.
-  H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang, “Cross-modality binary code learning via fusion similarity hashing,” in , pp. 7380-7388, 2017.
-  Q. Y. Jiang and W. J. Li, “Deep cross-modal hashing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3232-3240, 2017.
-  Z. Lin, G. Ding, J. Han, and J. Wang, “Cross-view retrieval via probability-based semantics-preserving hashing,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4342-4355, 2017.
-  X. Lu, L. Zhu, Z. Cheng, X. Song, and H. Zhang, “Efficient discrete latent semantic hashing for scalable cross-modal retrieval,” Signal Processing, vol. 154, pp. 217-31, 2019.
-  Q. Y. Jiang, and W. J. Li, “Discrete latent factor model for cross-modal hashing,” IEEE Transactions on Image Processing, 2019, in press.
-  F. Zhong, Z. Chen, and G. Min, “Deep discrete cross-modal hashing for cross-media retrieval.” Pattern Recognition, vol. 83, pp. 64-77, 2018.
-  J. Zhang, Y. Peng, and M. Yuan, “SCH-GAN: Semi-supervised cross-modal hashing by generative adversarial network,” IEEE Transactions on Cybernetics, 2018, in press.
-  M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in Proceedings of the ACM International Conference on Multimedia Information Retrieval, pp. 39-43, 2008.
-  T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in Proceedings of the ACM International Conference on Image and Video Retrieval, p. 48, 2009.