1 Introduction
With a tremendous amount of multimedia data being generated on the Internet everyday such as texts, images and so on, similaritypreserving hashing methods [1, 2, 3, 4, 5, 6, 7, 8, 9] have been extensively studied for largescale multimedia search due to their high retrieval efficiency and low storage cost. Because the corresponding data of different modalities may have semantic correlations, it is essential to support crossmodal retrieval that returns relevant results of one modality when querying another modality, e.g., retrieving images with text queries. Hence, crossmodal hashing methods [10, 11, 12, 13, 14, 15] get more and more attention.
Roughly speaking, crossmodal hashing methods can be divided into shallow crossmodal hashing methods [16, 2, 17, 10, 18, 19] and deep crossmodal hashing methods [20, 21, 14, 22, 11, 23]
. Shallow crossmodal hashing methods mainly use handcrafted features to learn projections for mapping each example into a binary code. The feature extraction procedure in shallow crossmodal hashing methods is independent of the hash codes learning procedure. It means that the shallow crossmodal hashing methods may not achieve satisfactory performance in real applications, because the handcrafted features might not be optimally suitable for hash codes optimizing procedure. Compared with shallow crossmodal hashing methods, deep crossmodal hashing methods can integrate feature learning and hash codes learning into a same framework, and capture nonlinear correlations among crossmodal instances more effectively to get better performance, where each instance contains two correlated datapoints of different modalities like imagetext pairs.
However, the existing deep crossmodal hashing methods either cannot learn a unified hash code for the two correlated datapoints of different modalities in a database instance or cannot guide the learning of unified hash codes by the feedback of hashing function learning procedure, to enhance the retrieval accuracy. First, most deep crossmodal hashing methods assume that there are different hash codes for the two correlated datapoints of different modalities in a database instance, and then try to decrease the gap between two hash codes through optimizing certain predefined loss functions. Thus, they just learn the similar hash codes for two correlated datapoints of different modalities in a same instance, and cannot obtain unified hash codes. However, the unified hash code schema has been proved that it can enhance the retrieval accuracy
[24, 25, 26]. Second, as far as we know, until now there is only one deep crossmodal hashing method that can learn unified hash codes [11]. The method is a two step framework. It first learns unified hash codes for instances in a database, and then utilizes the learned unified hash codes to learn modalspecific hashing function. It means the deep hashing method cannot guide the learning of unified hash codes by the feedback of hashing function learning procedure.To address the issues above, in this paper, we propose a novel Deep CrossModal Hashing with Hashing Functions and Unified Hash Codes Jointly Learning, called DCHUC. DCHUC can jointly learn unified hash codes for database instances and modalspecific hashing functions for unseen query points in an endtoend framework. More specifically, by minimising the objective function, DCHUC uses a fourstep iterative scheme to optimize the unified hash codes of the database instances and the hash codes of query datapoints generated by the learned hashing networks. With the iterative optimization algorithm, the learned unified hash codes can guide the hashing functions learning procedure; Meanwhile, the learned hashing function can feedback to guide the unified hash codes optimizing procedure. Moreover, the objective function consists of a hashing loss and a classification loss. The hashing loss is used to make the learned hash codes can preserve both intermodal and intramodal similarity, and the classification loss can be used to make the learned hashing codes preserve more discriminative semantic information.
In addition, because the training phase of deep models is typically timeconsuming, so it is hard to use all instances in a largescale database to train hashing model. Inspired by ADSH [27], we use an asymmetric scheme to reduce the training time complexity to O(mn). Specially, we samples anchors instances from database instances () to approximate query datasets, and constructs an asymmetric affinity to supervise hashing functions learning for unseen query instances and unified hash codes optimizing for instances in a database.
To summarize, the main contributions of DCHUC are outlined as follows:

To the best of our knowledge, DCHUC is the first deep method that can jointly learn unified hash codes for database instances and hashing functions for unseen query points in an endtoend framework. By using the endtoend framework, our method can get the highquality hash codes to improve the retrieval accuracy.

By treating the query instances and database instances in an asymmetric way, DCHUC can use the whole set of database instances in training phase to generate higherquality hash codes even if the size of a database is large.

Experiments on three largescale datasets show that DCHUC can outperform the stateoftheart crossmodal hashing baselines in real applications.
2 Related Work
In this section, we briefly review the related works of crossmodal hashing methods, including shallow crossmodal hashing methods and deep crossmodal hashing methods.
2.1 Shallow CrossModal Hashing Methods
Shallow crossmodal hashing methods [4, 28, 29, 25, 18, 19]
mainly use handcrafted features to learn a single pair of linear or nonlinear projections to map each example into a binary vector. The representative methods in this category include Cross Modality Similarity Sensitive Hashing (CMSSH)
[4], Semantic Correlation Maximization (SCM) [28], Cross View Hashing (CVH) [29],Latent Semantic Sparse Hashing (LSSH) [25], Collective Matrix Factorization Hashing (CMFH) [26], Semantics Preserving Hashing (SePH) [30], Supervised Discrete Manifoldembedded CrossModal Hashing (SDMCH) [18], Discrete Latent Factor hashing (DLFH) [19] and Discrete Crossmodal Hashing (DCH) [31]. CMSSH is a supervised hashing methods, which designs a crossmodal hashing method by preserving the intraclass similarity via eigendecomposition and boosting. SCM utilizes label information to learn a modalityspecific transformation, and preserves the maximal correlation between modalities. CVH presents an unsupervised crossmodal spectral hashing method so that the crossmodality similarity is also preserved in the learned hash functions. LSSH utilizes sparse coding and matrix factorization in the common space to obtain a unified binary by a latent space learning method. CMFH learns a unified binary hash code by performing matrix factorization with latent factor model in the training stage. SePH generates a unified binary hash code by constructing an affinity matrix in a probability distribution while at the same time minimizing the KullbackLeibler divergence. SDMCH generates binary hash codes by exploiting the nonlinear manifold structure of data and constructing the correlations among heterogeneous multiple modalities with semantic information. DLFH directly learns the binary hash codes without continuous relaxation by a discrete latent factor model. DCH jointly learns the unified binary codes and the modalityspecific hash functions under the classification framework with discrete optimization algorithm.
Despite of significant progress in this category has been achieved, the performance of handcrafted feature based methods are still unsatisfactory in many realworld applications. Because the feature extraction procedure is independent of the hashcode learning procedure in handcrafted feature based methods, which means that the handcrafted features might not be optimally suitable for the hash codes optimizing procedure.
2.2 Deep CrossModal Hashing Methods
Recently, deep crossmodal hashing methods [21, 32, 20, 14, 22, 11]
have been proposed to achieve promising performance due to the powerful arbitrary nonlinear representation of deep neural network. For example, deep visualsemantic hashing (DVSH)
[21] learns a visual semantic fusion network with cosine hinge loss to generate the binary codes and learns modalityspecific deep networks to obtain hashing functions. However, DVSH can only be used for some special crossmodal cases where one of the modalities have to be temporal dynamics. Deep crossmodal hashing (DCMH) [32]utilized a negative loglikelihood loss to generate crossmodal similarity preserving hash codes by an endtoend deep learning framework. Correlation Autoencoder Hashing (CAH)
[20] learns hashing functions by designing an autoencoder architecture to jointly maximize the feature and semantic correlation between different modalities. Adversarial crossmodal retrieval (ACMR) [14] utilizes a classification manner with adversarial learning approach to discriminate between different modalities and generate binary hash codes. Selfsupervised adversarial hashing (SSAH) [22] generates binary hash codes by utilizing two adversarial networks to jointly model different modalities and capture their semantic relevance under the supervision of the learned semantic feature. Crossmodal deep variational hashing (CMDVH) [11] uses a two step framework. In the first step the method learns unified hash code for imagetext pair in a database, and utilize the learned unified hash codes to learn hashing functions in the second step. Thus, for CMDVH, the learned hashing function in the second stage cannot give feedback to guide unified hash codes optimizing.Typically, deep crossmodal hashing methods can outperform shallow hashing methods in terms of retrieval accuracy. However, most of existing deep crossmodal hashing methods cannot bridge the modality gap well to generate unified hash codes for imagetext pairs in a database. Although CMDVH can generates unified binary codes for points of modalities, its hashing function learning procedure cannot feedback to guide the unified hash codes optimizing. Hence, CMDVH cannot get the optimal unified hash codes to bridge the modality gap well. Furthermore, please note that, although DCH can jointly learn unified hash codes for instances in a database and hashing functions for query instances, it is a shallow hashing method. Its feature extraction procedure is independent of the hash codes learning procedure, and DCH need use all the database instances to lean hashing functions which means it is hard to reconstruct DCH to a deep architecture. Thus, we propose a novel deep hashing method that can learn the unified hash codes for instances in a database and hashing functions for query instances in an endtoend framework.
3 Our Method
3.1 Problem Definition
Assume that we have training instances in a database, and each instance has two modal data points. Without loss of generality, we use imagetext databases for illustration in this paper, which means that each instance in the database has both a data point of text modality and a data point of image modality. We use to denote a crossmodal dataset with n instances, and , where and denote the original image and text points in the instance , respectively. is the label annotation assigned to , where c is the class number. If belongs to the class , otherwise . Furthermore, a pairwise similarity matrix is used to describe the semantic similarities between two instances. If , it means that is semantically similar to , otherwise . Specifically, if two instances and are annotated by multiple labels, we define when and share as least one label, otherwise .
Given the above database and similarity information , the goal of DCHUC is to learn the similaritypreserving hash codes for instances in the database, where is the length of each binary code and denotes the learned hash code for the instance , i.e., a unified hash code for the imagetext pair and . Meanwhile, the Hamming distance between and should be as small as possible when . Otherwise, the Hamming distance should be as large as possible. Moreover, in order to generate a binary code for any unseen image modal query point or text modal query point , DCHUC should learn two modalspecific hashing functions and , respectively. In order to learn the two hash functions, we sample a subset or use the whole set of as the query set for training, where denotes the query instances indexed by from the database . Moreover, we use to denote the indices of all the database instances and to denote the indices of the sampled query instances, and and denote image modal points and text modal points in query set , respectively. Correspondingly, the similarity between query instances and database instances can be denoted as , which is formed by the rows of indexed by . In addition, in this paper, is an elementwise sign function which returns if the element is positive and returns otherwise.
3.2 Deep CrossModal Hashing with Hashing Functions and Unified Hash Codes Jointly Learning
The model architecture for DCHUC is shown in Fig. 1, which contains three parts: image modal hashing network, text modal hashing network and hash codes optimizing.
For the image modal hashing network part, it contains a convolutional neural network (CNN) which is adapted from Alexnet
[33]. The CNN component contains eight layers. The first seven layers are the same as those in Alexnet [33]. The eighth layer is a fullyconnected layer with the output being the learned image features, which is named as hashing layer. The hashing layer contains units whereis the length of hash codes. An activation function
is used to make the output features close to . We use to denote the final output features of the image modal hashing network.For the text modal hashing network part, a neural network containing two fullyconnected layers is used to learn text modal features. We represent each text point as a bagofwords (BoW) vector, and use the BoW as the input of the twofullyconnected neural network. The first fullyconnected layer has
hidden units, and the activation function for the first fullyconnect layer is RELU
[33]. The second fullyconnected layer is also named as hashing layer with nodes. Similar to the image feature learning part, a function is used as an activation function to make the output features close to . We use to denote the final output features of the text modal hashing network.For the hash codes optimizing part, it will optimize hash codes for both database instances and query instances the objective function whose details will be introduced in section 3.3. More specially, with a fourstep iterative scheme, the unified hash codes for database instances will be learned directly and the modalspecific hashing functions can be learned by backpropagation algorithm which will be introduced in section 3.4 in detail. Furthermore, the hash codes for query instances are generated by the final output features of modalspecific hashing network with an elementwise function . Specifically, for an image modal query point , we can get its binary hash codes ; for a text modal query point , its binary hash codes can be generated by .
3.3 Objective Function
The goal of DCHUC is to map instances in the database and the unseen query datapoints into a semantic similaritypreserving Hamming space where the hash codes of datapoints from the same categories should be similar no mater which modalities they belong to, and the hash codes of datapoints from different categories should be dissimilar. In the following, we present more details about the objective function of our CMDAH.
In order to bridge the gap across different modalities well, we first assume the image point and text point for any instance in a database share the same hash code , i.e., learn a unified hash code for an imagetext pair and . Thus, the hash code can preserve the image modal information and text modal information at the same time. Moreover, in order to make the learned hash codes of instances in the database and the hash codes of query datapoints generated by the learned hashing functions can preserve the semantic similarity, one common way is to minimize the Frobenius norm loss between the semantic similarities and inner product of binary code pairs. Therefore, the hashing loss can be defined as follow:
(1)  
where is a hypeparameter, denotes the unified binary hash codes for database instances; denotes the columns of indexed by ; denotes the binary hash codes for images modal query datapoints, and denotes the binary hash codes for text modal query datapoints; is the output of images modal hashing network for image query set , and is the output of text modal hashing network for text query set .
Furthermore, in order to make the learned hashing codes preserve more discriminative semantic information, we expect the learned hashing codes can be ideal for classification too. Then the classification loss function can be defined as follow:
(2)  
where is the label matrix of instances in the database , and denotes the label matrix of query instances indexed by from the label matrix . and is the classification projected vector of the class .
Thus, our objective hashing function can be defined as follow:
(3)  
However, it is hard to learn functions and due to the derivation of function is . Moreover, considering the query set is sampled from the whole database, the hash codes generated by the learned hashing function should be the same with the directly learned hash codes, i.e., if an instance in the database is sampled as query instance, then the hash code for image modality datapoint and for text modality datapoint in should be the same with . Thus, we can further reformulate Formula (3) as:
(4)  
where are hyperparameters, is formed by the rows of indexed by .
3.4 Optimization
In order to optimize Formula (4), we propose a fourstep iterative scheme as shown below. More specifically, in each iteration we sample a query set from the database and then carry out our learning algorithm based on both the query set and database. The whole fourstep learning algorithm for DCHUC is briefly outlined in Algorithm 1, and the detailed derivation steps will be introduced in the following content of this subsection.
3.4.1 Learn with , and fixed
When , and are fixed, we update the parameter
of image hashing network by using a minibatch stochastic gradient descent with backpropagation (BP) algorithm. More specifically, for each sampled image point
in , we first compute the following gradient:(5)  
Then we can compute based on
by using chain rule, and use BP to update the parameter
.3.4.2 Learn with , and fixed
When and are fixed, we also update the parameter of text hashing network by using a minibatch stochastic gradient descent with BP algorithm. More specifically, for each sampled text point in , we first compute the following gradient:
(6)  
Then we can compute based on by using chain rule, and use BP to update the parameter .
3.4.3 Learn with , and fixed
When , and are fixed, we can reformulate Formula (4) as follows:
(7)  
where is a constant independent of and is the trace norm. For convenience of calculations, we can further reformulate Formula (7) as follows:
(8)  
where ; ; , and , are respectively defined as follows:
(9) 
(10) 
The above Formula (8) is NP hard. Inspired by SDH [34], the binary codes can be learned by the discrete cyclic coordinate descent (DCC) method. It means that we directly learn hash codes bit by bit. Specifically, we update one column of with the other column fixed. We let denotes the column of , and denotes the matrix of without the column ; Let denotes the column of , and denotes the matrix of without the column ; Let denotes the column of , and denotes the matrix of without the column ; Let denotes the row of , and denotes the matrix of without the row ; Let denotes the row of , and denotes the matrix of without the row . Then we can optimize by the following function:
(11)  
Finally, we can get the optimal solution of Formula (11):
(12) 
then we can use Formula (12) to update bit by bit.
3.4.4 Learn with , and fixed
When , and are fixed, we can reformulate Formula (4) as follows:
(13)  
For Formula (13), it is easy to solve W by the regularized least squares problem, which has a closedform solution:
(14) 
Task  Method  MIRFLICKR25K  IAPR TC12  NUSWIDE  
16bits  32bits  48bits  64bits  16bits  32bits  48bits  64bits  16bits  32bits  48bits  64bits  
CCAITQ  0.599  0.587  0.582  0.578  0.403  0.399  0.396  0.390  0.426  0.415  0.410  0.401  
SCM  0.639  0.612  0.584  0.592  0.438  0.423  0.414  0.398  0.403  0.371  0.349  0.328  
DCH  0.759  0.780  0.793  0.794  0.536  0.559  0.564  0.582  0.619  0.652  0.653  0.681  
DLFH  0.769  0.796  0.805  0.809  0.470  0.498  0.516  0.555  0.599  0.608  0.619  0.630  
DCMH  0.763  0.771  0.771  0.779  0.511  0.525  0.527  0.535  0.629  0.642  0.652  0.662  
CMDVH  0.612  0.610  0.553  0.600  0.381  0.383  0.396  0.381  0.371  0.359  0.399  0.424  
SSAH  0.783  0.793  0.800  0.783  0.538  0.566  0.580  0.586  0.613  0.632  0.635  0.633  
DCHUC  0.850  0.857  0.853  0.854  0.615  0.666  0.681  0.693  0.698  0.728  0.742  0.749  
CCAITQ  0.593  0.582  0.577  0.574  0.312  0.311  0.310  0.309  0.424  0.412  0.398  0.387  
SCM  0.626  0.595  0.588  0.578  0.313  0.310  0.309  0.308  0.395  0.368  0.353  0.335  
DCH  0.748  0.786  0.799  0.805  0.486  0.486  0.496  0.502  0.648  0.678  0.699  0.708  
DLFH  0.719  0.732  0.742  0.748  0.417  0.451  0.484  0.490  0.558  0.578  0.591  0.593  
DCMH  0.721  0.733  0.729  0.742  0.464  0.485  0.490  0.498  0.588  0.607  0.615  0.632  
CMDVH  0.611  0.626  0.553  0.598  0.376  0.373  0.365  0.376  0.370  0.373  0.414  0.425  
SSAH  0.779  0.789  0.796  0.794  0.539  0.564  0.581  0.587  0.659  0.666  0.679  0.667  
DCHUC  0.878  0.882  0.880  0.881  0.630  0.677  0.695  0.701  0.750  0.771  0.783  0.791 
Task  Method  MIRFLICKR25K  IAPR TC12  NUSWIDE  
16bits  32bits  48bits  64bits  16bits  32bits  48bits  64bits  16bits  32bits  48bits  64bits  
CCAITQ  0.690  0.676  0.666  0.652  0.491  0.492  0.488  0.482  0.622  0.672  0.684  0.683  
SCM  0.749  0.714  0.675  0.639  0.504  0.506  0.523  0.497  0.598  0.576  0.532  0.668  
DCH  0.848  0.848  0.843  0.852  0.664  0.695  0.701  0.712  0.808  0.819  0.808  0.815  
DLFH  0.834  0.857  0.865  0.870  0.563  0.604  0.638  0.660  0.685  0.707  0.717  0.735  
DCMH  0.815  0.824  0.834  0.835  0.596  0.610  0.613  0.626  0.694  0.710  0.721  0.731  
CMDVH  0.613  0.636  0.545  0.601  0.396  0.410  0.403  0.396  0.340  0.293  0.408  0.417  
SSAH  0.824  0.834  0.846  0.855  0.641  0.664  0.674  0.677  0.701  0.729  0.736  0.731  
DCHUC  0.896  0.897  0.890  0.888  0.711  0.760  0.771  0.782  0.799  0.825  0.839  0.849  
CCAITQ  0.666  0.656  0.649  0.635  0.401  0.341  0.302  0.302  0.607  0.657  0.667  0.666  
SCM  0.738  0.704  0.676  0.660  0.376  0.349  0.324  0.315  0.606  0.565  0.550  0.504  
DCH  0.844  0.866  0.860  0.868  0.593  0.604  0.612  0.617  0.813  0.829  0.822  0.817  
DLFH  0.800  0.817  0.824  0.825  0.480  0.536  0.584  0.596  0.646  0.682  0.703  0.698  
DCMH  0.764  0.795  0.817  0.822  0.546  0.572  0.580  0.595  0.667  0.686  0.704  0.709  
CMDVH  0.693  0.761  0.695  0.733  0.371  0.380  0.331  0.371  0.493  0.527  0.598  0.589  
SSAH  0.840  0.854  0.859  0.863  0.648  0.663  0.681  0.678  0.738  0.749  0.765  0.749  
DCHUC  0.917  0.918  0.912  0.911  0.724  0.766  0.781  0.783  0.845  0.859  0.872  0.881 
3.5 OutofSample Extension
For any instance which is not in the retrieval set, we can obtain the hash code of its two modalities. In particular, given the image modality in an instance , we can adopt forward propagation to generate the hash code as follows:
(15) 
Similarly, we can also use the text hashing network to generate the hash code of the instance with only textual modality :
(16) 
4 Experiments
To evaluate the performance of DCHUC, we will carry out extensive experiments on three imagetext datasets and compared it with seven stateoftheart crossmodal hashing methods.
4.1 Datasets
Three datasets are used for evaluation, i.e., MIRFLICKR25K [35], IAPR TC12 [36] and NUSWIDE [37], which are described below.
The MIRFLICKR25K dataset [35] contains 25,000 instances collected from Flickr website. Each image is labeled with several textual tags. Here, we follow the experimental protocols given in DCMH [32]. In total, 20,015 data instances which have at least 20 textual tags have been selected for our experiment. The text modality for each instance is represented as a 1,386dimensional bagofwords (BoW) vector. Furthermore each instance is manually labeled with at least one of the 24 unique labels. For this dataset, we randomly sampled 2,000 instances as the test set, and the remaining as the database (retrieval set). Furthermore, the training phase of the existing deep crossmodal hashing methods are typically timeconsuming, which makes them cannot efficiently work on largescale datasets. Therefore, for deep methods, we randomly sample 10,000 instances from the retrieval set as the training set.
The IAPR TC12 [36] consists of 20,000 instances which are annotated using 255 labels. After pruning the instance that is without any text information, a subset of 19999 imagetext pairs are select for our experiment. The text modality for each instance is represented as a 2000dimensional BoW vector. For this dataset, we randomly sampled 2,000 instances as test set, with the rest of the instances as retrieval set. We randomly select 10,000 instances from retrieval set for training deep crossmodal baselines.
The NUSWIDE dataset [37] contains 269,648 instances crawled from Flickr. Each image is associated with textual tags, and each instance is annotated with one or multiple labels from 81 concept labels. Only 195,834 imagetext pairs that belong to the 21 most frequent concepts are selected for our experiment. The text modality for each instance is represented as a 1000dimensional BoW vector. For this dataset, we randomly sampled 2,100 instance as test set, with the rest of the instances as retrieval set. Because the deep hashing baselines are very timeconsuming for training, we randomly select 10,500 instances from database for training deep crossmodal baselines.
For all the shallow crossmodal baselines, all the database are used for training. For all datasets, the image and text will be defined as a similar pair if and share at least one common label. Otherwise, they will be defined as a dissimilar pair.
4.2 Baselines and Implementation Details
We compare our DCHUC with seven stateoftheart methods, including four shallow crossmodal hashing methods, i.e., DLFH [19], SCM [28], CCAITQ [38] and DCH [31], and three deep crossmodal hashing methods, i.e., DCMH [32], CMDVH [11] and SSAH [22]. The source codes of all baselines but CMDVH and DCH are kindly provided by the authors. We carefully tuned their parameters according to the scheme suggested by the authors. For CMDVH and DCH, we implement it carefully by ourselves. In order to make a fair comparison, we utilize Alexnet [33]
, which has been pretrained on the ImageNet dataset
[39]to extract deep features as the image inputs of all shallow crossmodal baselines, and the input for image modality hashing network of each deep crossmodal baseline is the
raw pixels.For the proposed method, we initialize the first seven layers neural network in image feature learning part with the pretrained Alexnet [33] model on ImageNet [39]. All the parameters of the text modal hashing network and the hashing layer of image hashing network are initialized by Xavier initialization [40]. The unified binary code B is initialized randomly and zerocentered. The image input is the raw pixels, and the text inputs are the BoW vectors. The hyperparameters in DCHUC are empirically set as 50, 200, 1, 50, 50, respectively, and they will be discussed in Section 4.7. We set , , by using a validation strategy for all datasets. We adopt SGD with a minibatch size of 64 as our optimization algorithm. The learning rate is initialized as for image hashing network and for text modal hashing network. To avoid effect caused by classimbalance problem between positive and negative similarity information, we empirically set the weight of the element in S as the ratio between the number of element and the number of element in S.
The source codes of CMDVH, DCH and our proposed method will be available at: https://github.com/AcademicHammer
4.3 Evaluation Protocol
For hashingbased crossmodal retrieval task, Hamming ranking and hash lookup are two widely used retrieval protocols to evaluate the performance of hashing methods. In our experiments, we use three evaluation criterions: the mean average precision (MAP), the precision at n (P@n) and the precisionrecall (PR) curve. MAP is the widely used metric to measure the accuracy of the Hamming ranking protocol, which is defined as the mean of average precision for all queries. PR curve is used to evaluate the accuracy of the hash lookup protocol, and P@n is used to evaluate precision by considering only the number of top returned points.
4.4 Experimental results
All experiments are run 3 times to reduce randomness, then the average accuracy is reported.
4.4.1 Hamming Ranking Task
Table II and Table II present the MAP and Precision@1000 on MIRFLICKR25K, IAPR TC12 and NUSWIDE datasets, respectively. denotes retrieving texts with image queries, and denotes retrieving images with text queries. In general, from Table II and Table II, we have three observations: () Our proposed method can outperforms the other crossmodal hashing methods for different length of hash code. For example, on MIRFLICKR25K, comparing with the best competitor SSAH on 16bits, the results of DCHUC for have a relative increase of 12.7% on MAP and 9.2% on Precision@1000; the results of DCHUC for have a relative increase of 8.6% on MAP and 8.7% on Precision@1000. On IAPR TC12, comparing with the competitor SSAH on 64bits, the results of DCHUC for have a relative increase of 19.4% on MAP and 15.5% on Precision@1000; the results of DCHUC for have a relative increase of 18.3% on MAP and 15.5% on Precision@1000. On NUSWIDE, comparing with the best competitor DCH on 64bits, the results of DCHUC for have a relative increase of 12.3% on MAP and 7.8% on Precision@1000; () Integrating the feature learning of datapoints and hashing function learning into an endtoend network can get the better performance. For example, our proposed method can get a better performance than DCH which also can jointly learning unified hashing codes for instances in the database and modalspecific hashing functions for unseen datapoints but the feature extraction procedure is independent of the hash codes learning procedure. () Jointly learning unified hashing codes for database instances and modalityspecific hashing functions for unseen datapoints can greatly increase the retrieval performance. For instance, DCHUC can get better performance on MAP and Precision@1000 over three benchmark datasets than CMDVH. Note that, the results of CMDVH is not as good as the results of the original article. It maybe the reason that we used more classes of label to carry out our experimental, which is hard to train the svm used in CMDVH. Furthermore, although DCH is a shallow hashing method, its retrieval performances on MIRFLICKR25K and IAPR TC12 datasets are similar to the best deep baseline SSAH, and its retrieval performances on NUSWIDE dataset is batter than SSAH.
4.4.2 Hash Lookup Task
When considering the lookup protocol, we compute the precision and recall (PR) curve for the returned points given any Hamming radius. The PR curve can be obtained by varying the Hamming radius from 0 to
with a stepsize of 1. Fig. 2, Fig. 3 and Fig. 4 show the precisionrecall curve on MIRFLICKR25K, IAPR TC12 and NUSWIDE datasets, respectively. It is easy to find that DCHUC can dramatically outperform the stateoftheart baselines, which means our DCHUC generates hash codes for similar points in a small Hamming radius. For example, compared with baselines, the precision value of DCHUC decreases more slowly with the recall value increasing, and DCHUC can get a high precision value even though the recall value increasing to 0.9 on MIRFLICKR25K and NUSWIDE datasets.4.5 Convergence Analysis
To verify the convergence property of DCHUC, we conduct an experiment over NUSWIDE dataset with the code length being 64. Fig. 5 shows the convergence of objective function value and MAP. As shown in Fig. 5 (a), the objective function value can convergence after only 10 iterations. In Fig. 5 (b), denotes retrieving texts with image queries, and denotes retrieving images with text queries. We can find the MAP values of both the two retrieval task can convergence. Furthermore, combining the two subfigure Fig. 5 (a) and (b), we can find both the two map values can increase with the objective function value decrease and eventually converge.
4.6 Training efficiency
To evaluate the training speed of DCHUC, we conduct experiments between the deep crossmodal baselines except CMDVH on three datasets. Fig. 6 shows the variation between MAP and training time on the three datasets for DCHUC, SSAH and DCMH. It can be find that DCHUC can not only training faster than the two deep crossmodal baselines, but also get a better performance on retrieval tasks than them. For the CMDVH baseline, it is a two step method. Then it is unfair to compare MAPTime curve. In here, we calculate the the whole training time of CMDVH. The cost times of training phase on IAPR TC12, MIRFLICKR25K and NUSWIDE datasets with 32bits are 16.3s, 21.2s and 39.2s for CMDVH, and are 11.8s, 12.9s and 28.2s for DCHUC, respectively. We can find that DCHUC is also the faster one.
4.7 Sensitivity to Parameters
We study the influence of the hyperparameters and on IAPR TC12, MIRFLICKR25K and NUSWIDE datasets with the code length being 64bits. More specially,Fig. 7 (a), (f) and (k) show the affect of the hyperparameter over the three datasets with the value between and . Fig. 7 (b), (g) and (i) show the affect of the hyperparameter over the three datasets with the value between and . Fig. 7 (c), (h) and (m) show the affect of the hyperparameter over the three datasets with the value between and . Fig. 7 (d), (i) and (n) show the affect of the hyperparameter over the three datasets with the value between and . Fig. 7 (e), (j) and (o) show the affect of the hyperparameter over the three datasets with the value between and . It can be found that DCHUC is not sensitive to and . For instance, DCHUC can achieve good performance on all the three datasets in the range of to for the hyperparameters and , and also can achieve good performance on all the three datasets with . Furthermore, DCHUC can get the high MAP values with different from the range of to .
5 Conclution
In this paper, we have proposed a novel crossmodal deep hashing method for crossmodal data, called DCHUC. To the best of our knowledge, DCHUC is the first deep method to jointly learn unified hash codes for database instances and hashing functions for unseen query points in an endtoend framework. Extensive experiments on three realworld public datasets have shown that the proposed DCHUC method outperforms the stateoftheart crossmodal hashing methods.
Acknowledgment
The work is supported by SFSMBRP(2018YFB1005100), BIGKE(No. 20160754021), NSFC (No. 61772076 and 61751201), NSFB (No. Z181100008918002), Major Project of Zhijiang Lab (No. 2019DH0ZX01), CETC (No. w2018018) and OPBKLICDD (NO. ICDD201901).
References
 [1] Z. Cao, M. Long, J. Wang, and S. Y. Philip, “Hashnet: Deep learning to hash by continuation.” in ICCV, 2017, pp. 5609–5618.
 [2] X. Liu, X. Nie, W. Zeng, C. Cui, L. Zhu, and Y. Yin, “Fast discrete crossmodal hashing with regressing from semantic labels,” in 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 2018, pp. 1662–1669.

[3]
T. Zhang and J. Wang, “Collaborative quantization for crossmodal similarity
search,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 2036–2045.  [4] M. M. Bronstein, A. M. Bronstein, F. Michel, and N. Paragios, “Data fusion through crossmodality metric learning using similaritysensitive hashing,” in 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010, pp. 3594–3601.
 [5] Z. Jin, C. Li, Y. Lin, and D. Cai, “Density sensitive hashing.” IEEE Trans. Cybernetics, vol. 44, no. 8, pp. 1362–1371, 2014.

[6]
S. Huang, Y. Xiong, Y. Zhang, and J. Wang, “Unsupervised triplet hashing for fast image retrieval,” in
Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 2017, pp. 84–92.  [7] K. Ghasedi Dizaji, F. Zheng, N. Sadoughi, Y. Yang, C. Deng, and H. Huang, “Unsupervised deep generative adversarial hashing network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3664–3673.
 [8] D. Wang, H. Huang, C. Lu, B.S. Feng, L. Nie, G. Wen, and X.L. Mao, “Supervised deep hashing for hierarchical labeled data,” pp. 7388–7395, 2018.
 [9] Z. Qiu, Y. Pan, T. Yao, and T. Mei, “Deep semantic hashing with generative adversarial networks,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017, pp. 225–234.
 [10] H. Liu, R. Ji, Y. Wu, F. Huang, and B. Zhang, “Crossmodality binary code learning via fusion similarity hashing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7380–7388.
 [11] V. Erin Liong, J. Lu, Y.P. Tan, and J. Zhou, “Crossmodal deep variational hashing,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4077–4085.
 [12] F. Feng, X. Wang, and R. Li, “Crossmodal retrieval with correspondence autoencoder,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 7–16.
 [13] C. Deng, X. Tang, J. Yan, W. Liu, and X. Gao, “Discriminative dictionary learning with common label alignment for crossmodal retrieval,” IEEE Transactions on Multimedia, vol. 18, no. 2, pp. 208–218, 2016.
 [14] B. Wang, Y. Yang, X. Xu, A. Hanjalic, and H. T. Shen, “Adversarial crossmodal retrieval,” in Proceedings of the 25th ACM international conference on Multimedia. ACM, 2017, pp. 154–162.

[15]
E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise relationship
guided deep hashing for crossmodal retrieval,” in
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [16] W. Liu, C. Mu, S. Kumar, and S.F. Chang, “Discrete graph hashing,” in Advances in neural information processing systems, 2014, pp. 3419–3427.
 [17] W. Liu, J. Wang, R. Ji, Y.G. Jiang, and S.F. Chang, “Supervised hashing with kernels,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 2074–2081.
 [18] X. Luo, X.Y. Yin, L. Nie, X. Song, Y. Wang, and X.S. Xu, “Sdmch: Supervised discrete manifoldembedded crossmodal hashing.” in IJCAI, 2018, pp. 2518–2524.
 [19] Q.Y. Jiang and W.J. Li, “Discrete latent factor model for crossmodal hashing,” IEEE Transactions on Image Processing, 2019.
 [20] Y. Cao, M. Long, J. Wang, and H. Zhu, “Correlation autoencoder hashing for supervised crossmodal search,” in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016, pp. 197–204.
 [21] Y. Cao, M. Long, J. Wang, Q. Yang, and P. S. Yu, “Deep visualsemantic hashing for crossmodal retrieval,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1445–1454.
 [22] C. Li, C. Deng, N. Li, W. Liu, X. Gao, and D. Tao, “Selfsupervised adversarial hashing networks for crossmodal retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4242–4251.
 [23] Y. Shen, L. Liu, L. Shao, and J. Song, “Deep binaries: Encoding semanticrich cues for efficient textualvisual cross retrieval,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4097–4106.
 [24] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010, pp. 251–260.
 [25] J. Zhou, G. Ding, and Y. Guo, “Latent semantic sparse hashing for crossmodal similarity search,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 2014, pp. 415–424.
 [26] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2075–2082.
 [27] Q.Y. Jiang and W.J. Li, “Asymmetric deep supervised hashing,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [28] D. Zhang and W.J. Li, “Largescale supervised multimodal hashing with semantic correlation maximization,” in TwentyEighth AAAI Conference on Artificial Intelligence, 2014.
 [29] S. Kumar and R. Udupa, “Learning hash functions for crossview similarity search,” in TwentySecond International Joint Conference on Artificial Intelligence, 2011.
 [30] Z. Lin, G. Ding, M. Hu, and J. Wang, “Semanticspreserving hashing for crossview retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3864–3872.
 [31] X. Xu, F. Shen, Y. Yang, H. T. Shen, and X. Li, “Learning discriminative binary codes for largescale crossmodal retrieval,” IEEE Transactions on Image Processing, vol. 26, no. 5, pp. 2494–2507, 2017.
 [32] Q.Y. Jiang and W.J. Li, “Deep crossmodal hashing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3232–3240.
 [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [34] F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 37–45.
 [35] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in Proceedings of the 1st ACM international conference on Multimedia information retrieval. ACM, 2008, pp. 39–43.
 [36] H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. LópezLópez, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger, “The segmented and annotated iapr tc12 benchmark,” Computer Vision and Image Understanding, vol. 114, no. 4, pp. 419–428, 2010.
 [37] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: a realworld web image database from national university of singapore,” in Proceedings of the ACM international conference on image and video retrieval. ACM, 2009, p. 48.
 [38] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 2916–2929, 2013.
 [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
 [40] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256.
Comments
There are no comments yet.