1 Introduction
Crossmodal retrieval Wang et al. (2017a, 2018b); Carvalho et al. (2018); Yu et al. (2018); Wang et al. (2018c); Song and Soleymani (2019); Shang et al. (2019) takes a certain kind of modality data as query objects to retrieve the relevant data in other modalities. Meanwhile, a large amounts of heterogeneous multimodal data are explosively generated in various social networks. To tackle the retrieval efficiency problem, crossmodal hashing Ding et al. (2014); Wang et al. (2019b, a, 2015); Xie et al. (2016b); Wang et al. (2018a); Xie et al. (2016c); Tang et al. (2016); Xie et al. (2016a); Liu et al. (2018) is proposed to project the highdimensional multimodal data into the lowdimensional binary hash codes, which are forced to express consistent semantics with the original data. For the high retrieval and storage efficiency, it has aroused considerable attention to solve largescale crossmodal search.
With the trend, the hashing methods in the field of crossmodal search have become a research hotspot in plenty of literatures. There are two main categories of these methods: unsupervised Kumar and Udupa (2011); Song et al. (2013); Zhu et al. (2013); Ding et al. (2014); Liong et al. (2018); Hu et al. (2018b) and supervised Zhang and Li (2014); Lin et al. (2015); Tang et al. (2016); Wang et al. (2015); Xu et al. (2017); Wang et al. (2018a) crossmodal hashing. Unsupervised crossmodal hashing methods learn the lowdimensional embedding of original data without any semantic labels. The generated hash codes are learned to keep the semantic correlation of heterogeneous multimodal data. Contrastively, supervised crossmodal hashing methods exhibit a strong discrimination capability in the learning process of hash codes with the supervision of explicit semantic labels.
The shallow crossmodal hashing method has always been the core and main force of crossmodal retrieval and achieved promising results. With the problem studied deeply, the biggest defect of these methods is that the hash functions depend on linear or simple nonlinear projections. This may limit the discriminative capability of modality feature representation, and result in the low retrieval accuracy of the learned binary codes. Recently, deep crossmodal hashing Zhang et al. (2018); Hu et al. (2018a); Jiang and Li (2017); Zhong et al. (2018) is proposed to simultaneously perform deep representation and hash code learning. They replace the linear mapping with multilayer nonlinear mapping and thus capture the intrinsic semantic correlations of crossmodal instances more effectively. It has been proved that crossmodal hashing methods based on deep models have better performance than the shallow hash models which directly adopt handcrafted features.
Although great success has been achieved by existing methods, they equally handle the crossmodal retrieval tasks (e.g. image retrieves text and text retrieves image), and simply learn the same couple of hash functions for them. Under such circumstance, the characteristics of different crossmodal retrieval tasks are ignored and suboptimal performance may be caused accordingly. To tackle the limitation, this paper proposes a Taskadaptive Asymmetric Deep Crossmodal Hashing (TADCMH) method to learn taskspecific hash functions for each crossmodal subretrieval tasks. The major contributions and innovations are stated as follows:

We propose a new supervised asymmetric hash learning framework based on deep neural networks for largescale crossmodal search. Two couples of deep hash functions can be learned for different crossmodal retrieval tasks, by performing simultaneous deep feature representation and asymmetric hash learning. For all we know, no similar work has been proposed yet.

In asymmetric hash learning part, we jointly optimize the semantic preserving of original data from multiple modalities, and the representation capability enhancement of query modality. With such design, the learned hash codes can establish a semantic connection across different modalities, as well as capture the query semantics of the specific crossmodal retrieval task.

An iterative optimization algorithm is proposed to enable the discreteness of hash codes and alleviate the errors of binary quantization. The results of experiment demonstrate that this algorithm is superior on two datasets widely tested in crossmodal retrieval.
2 Literature review of crossmodal hashing
2.1 Unsupervised Crossmodal Hashing
Unsupervised crossmodal hashing transforms the modality features into the shared hash codes by preserving the original similarities. Representative works include Crossview Hashing (CVH) Kumar and Udupa (2011), Intermedia Hashing (IMH) Song et al. (2013), Linear Crossmodal Hashing (LCMH) Zhu et al. (2013), Collective Matrix Factorization Hashing (CMFH) Ding et al. (2014), Latent Semantic Sparse Hashing (LSSH) Zhou et al. (2014), Robust and Flexible Discrete Hashing (RFDH) Wang et al. (2017b), Crossmodal Discrete Hashing (CMDH) Liong et al. (2018) and Collective Reconstructive Embeddings (CRE) Hu et al. (2018b). CVH is a typical graphbased hashing method extended from the standard spectral hashing Weiss et al. (2009)
. It minimizes the weighted Hamming distances to transform the original multiview data into the binary codes. IMH maps heterogeneous multimedia data into hash codes by constructing graphs. It learns the hash functions by linear regression for new instances. Its joint learning scheme can effectively preserve the inter and intra modality consistency. LCMH first leverages kmeans clustering to represent each training data as kdimensional vector, and then maps the vector into the tobelearnt binary codes. CMFH utilizes collective matrix factorization model to transform multimedia data into low dimensional space, then approximates it with hash codes. It also fuses the multiview information to enhance the search accuracy. LSSH follows similar idea of CMFH. It attempts to learn the latent factor matrix for image structures by sparse coding and text concepts by matrix decomposing. Compared with CMFH, it can better capture highlevel semantic correlation for similarity search across different modalities. RFDH first learns the unified hash codes for each training data by employing discrete collaborative matrix factorization. Then, it jointly adopts l2,1norm and adaptively weight of each modality to enhance the robustness and flexibility of hash codes. CMDH presents a discrete optimization strategy to learn the unified binary codes for multiple modalities. This strategy projects the heterogeneous data into a lowdimensional latent semantic space by using matrix factorization. The latent features are quantified as the hash codes by projection matrix. CRE is proposed to learn unified binary codes and binary mappings for different modalities by collective reconstructive embedding. It simultaneously bridges the semantic gap between heterogeneous data.
2.2 Supervised Crossmodal Hashing
Supervised crossmodal hashing generates the hash codes under the guidance of semantic information. Typical methods include Semantic Correlation Maximization (SCM) Zhang and Li (2014), SemanticsPreserving Hashing (SePH) Lin et al. (2015), Supervised Matrix Factorization Hashing (SMFH) Tang et al. (2016), Semantic Topic Multimodal hashing (STMH) Wang et al. (2015), Discrete Latent Factor Model based CrossModal Hashing (DLFH) Jiang and Li (2019), Discrete Crossmodal Hashing (DCH) Xu et al. (2017) and Label Consistent Matrix Factorization Hashing (LCMFH) Wang et al. (2018a)
. SCM aims at preserving maximum semantic information into hash codes by avoiding computing pairwise semantic matrix explicitly. It improves both the retrieval speed and space utilization. SePH first employs probability distribution to preserve supervision information of multimodal data, and then the hash codes can be obtained by solving the problem of KullbackLeibler divergence. SMFH is developed based on the collective matrix decomposing. It jointly employs graph Laplacian and semantic label to learn binary codes for multimodal data. STMH employs semantic modeling to detect different semantic themes for texts and images respectively, and then maps the captured semantic representations into a lowdimensional latent space to obtain hash codes. DLFH proposes an efficient hash learning algorithm based on the discrete latent factor model to directly learn binary hash codes for crossmodal retrieval. DCH is an extended application of Supervised Discrete Hashing (SDH)
Shen et al. (2015) in multimodal retrieval. It learns a set of modalitydependence hash projections as well as discriminative binary codes to keep the classification consistent with the label for multimodal data. LCMFH leverages the auxiliary matrix to project the original multimodal data to the lowdimensional representation of latent space, and quantizes it with semantic label to the hash codes.All the above hashing methods are shallow modeling, which imposes linear or nonlinear transformations to construct the hash functions. Thus, these methods cannot effectively explore the semantic correlations of heterogeneous multimodal data.
2.3 Deep Crossmodal Hashing
They basically seek a common binary semantic space via multilayer nonlinear projection from multiple heterogeneous modalities. Stateoftheart deep crossmodal hashing methods include Unsupervised Generative Adversarial Crossmodal Hashing (UGACH) Zhang et al. (2018), Deep Binary Reconstruction for Crossmodal Hashing (DBRC) Hu et al. (2018a), Deep CrossModal Hashing (DCMH) Jiang and Li (2017), Discrete Deep CrossModal Hashing (DDCMH) Zhong et al. (2018) and Selfsupervised adversarial hashing (SSAH) Li et al. (2018)
. UGACH is proposed to promote the learning of hash functions by the confrontation between generative model and discriminative model, and incorporates the correlation graph into the learning procedure to capture the intrinsic manifold structures of multimodal data. DBRC develops a deep network based on a special Multimodal Restricted Boltzmann Machine (MRBM) to learn binary codes. The network employs the adaptive tanh hash function to obtain the binary valued representation instead of joint real value representation, and reconstructs the original data to preserve the maximum semantic similarity across different modalities. DCMH first extracts the deep features of text and image modalities through two neural networks, and then preserves the similarity of two different deep features into a unified hash codes by using a pairwise similarity matrices. DDCMH proposes a crossmodal deep neural networks to directly encode the binary hash codes by employing discrete optimization, which can effectively preserve the intra and intermodality semantic correlation. SSAH devices a deep selfsupervised adversarial network to solve crossmodal hashing problem. This network combines multilabel semantic information and adversarial learning to eliminate the semantic gap between deep features extracted from heterogeneous modalities.
Differences
: The existing deep learning based crossmodal hashing approaches equally handle the different crossmodal retrieval tasks when constructing the hash functions. Under such circumstance, the characteristics of crossmodal retrieval tasks are ignored during the hash learning process, and thus suboptimal performance may be achieved accordingly. Different from them, in our paper, we put forward a taskadaptive crossmodal hash learning model to learn two couples of hash functions for two crossmodal subretrieval tasks respectively. In our model, the semantic similarity across different modalities are preserved and the representation capability of query modality is enhanced. With such learning framework, the hash codes we learned can simultaneously capture the semantic correlation of different modalities and the query semantics of the specific crossmodal retrieval task.
Notation  Description 

X  the raw image matrix 
the text feature matrix  
deep feature representation matrix of image  
deep feature representation matrix of text  
semantic projection matrix of image  
semantic projection matrix of text  
pairwise semantic matrix  
pointwise semantic label  
binary hash codes  
minibatch size  
the dimension of text  
c  the number of classes 
r  hash code length 
T  iteration numbers 
t  the number of retrieval tasks 
3 Taskadaptive asymmetric deep crossmodal hashing
3.1 Notations and problem definition
Assume that a database with training instances is denoted as , the training instance is comprised of two modalities: image and text. denotes the raw image matrix. represents the text feature matrix with dimensions. Each of instance is associated with instance . Besides, the pointwise semantic label is given as , where is the total number of categories and implies that belongs to class , otherwise . We define the pairwise semantic matrix , each element of which is represented as . When , the image is similar to the text , otherwise, when , the image is dissimilar to the text . In general, crossmodal retrieval problem (includes two modalities image and text) has two subretrieval tasks: one is the task of image searches text (I2T), and the other task is text searches image (T2I). The goal of our method is to learn two kinds of nonlinear hash functions and for different crossmodal retrieval task, where is the length of hash codes, the binary hash codes relates to images hash functions for I2T task, and the binary hash codes relates to texts hash functions for T2I task. Table 1 shows the list of main notations used in this paper.
3.2 Model formulation
In this paper, we propose an supervised asymmetric deep crossmodal hashing model, which includes two parts: deep feature learning and asymmetric hash learning. In the first part, we extract the deep image and text feature representations from two couples of deep neural networks. In the second part, we perform asymmetric hash learning to capture the semantic correlations of multimedia data with the supervision of pairwise semantic matrix and enhance the discriminative ability of query modality representation with pointwise semantic label. The overall learning framework of our TAADCMH method is illustrated in Figure 1.
3.2.1 Deep feature learning
In the deep feature learning part, we design two couples of deep neural networks for two crossmodal subretrieval tasks. As shown in the Figure 1
, we can find that each pair of imagetext deep networks are used to perform I2T and T2I subretrieval tasks, respectively. To be fair, we use similar deep neural networks of image modality for two subretrieval tasks. Both two deep networks are based on convolutional neural network (CNNF) and use the pretrained ImageNet dataset
Deng et al. (2009)to initialize the weights of networks. Particularly, CNNF is an eight layer deep network structure with five layers convolution layer and three fullyconnected layers. We modify the structure of last fullyconnected layer by setting the length of hash codes as the number of hidden units, and adopt identity function as the activation function for the last network layers. We also use two deep neural networks of text modality for two subretrieval tasks, each of which consist of two fullyconnected layers. Particularly, we represent the original text vectors as the BagofWords (BOW)
Yang et al. (2015) which is then used as the input to the deep neural network. Further, we obtain the hash codes as the outputs from the last fullyconnected layer. Similar to the image network, we also adopt identity function as the activation function. In this paper, the deep hash functions are denoted as for image modality and text modalities separately, where is the weight parameters of deep image neural networks and is the weight parameters of deep text neural networks.3.2.2 Asymmetric hash learning for I2T
The crossmodal retrieval task concentrates on two subretrieval tasks: image retrieves text and text retrieves image. Previous methods generally learn the same couple of hash functions in an symmetric way for two different crossmodal retrieval tasks. They cannot effectively capture the query semantics during the nonlinear multimodal mapping process, as ignoring the characteristics of different crossmodal retrieval tasks. To address these problems, in this paper, we develop an asymmetric hash learning model to learn different hash functions for different retrieval tasks. Specifically, for each task, besides to optimize the semantic preserving of multimodal data into hash codes, we perform the semantic regression from queryspecific modality representation to the explicit labels. With such design, the semantic correlations of multimodal data can be preserved into the hash codes, and simultaneously, the query semantics can be captured adaptively.
The overall objective function of I2T subretrieval task is formulated as
(1) 
where , , , are all the regularization parameters, , with , with are the deep features extracted from images and texts respectively. is the binary hash codes to be learned for I2T task. It is binary values by imposing the discrete constraint. is the pointwise semantic label. is the semantic projection matrix which supports the semantic regression from image (query) modality representation to the L. The first term in Eq.(1) is negative log likelihood function, which is based on the likelihood function defined as
(2) 
where . The negative log likelihood function can make and as similar as possible when , and be dissimilar when . Thus, this term can preserve the semantic correlation between deep image feature and deep text feature by the pairwise semantic supervision. The second and third terms in Eq.(1) transform the deep features and into the binary hash codes , which collectively preserve the crossmodal semantics into the binary hash codes. The last term is to avoid overfitting. It is defined as below:
(3) 
The term is to equally partition the information of each bit and ensure the maximum semantic similarity preserved into hash codes.
3.2.3 Asymmetric hash learning for T2I
Different from I2T subretrieval task, we directly regress the deep text representation to the corresponding pointwise semantic label to persist the discriminative information of query modality representation. Specifically, we adopt pairwise semantic label to learn a new binary hash codes to preserve the semantic correlation of multimodal data and capture the query semantics from texts.
Similar to Eq.(1), the objective function for T2I subretrieval task is formulated as:
(4) 
where , with , with are the deep features extracted from images and texts respectively. is the semantic projection matrix which supports the semantic regression from text (query) modality representation to the L. The balance parameters , , and are regularization parameters of T2I task. The regularization function is denoted as follows:
(5) 
This term is same as that for I2T task, which is used to balance each bit of hash codes.
3.3 Optimization scheme
The objective functions for I2T and T2I retrieval tasks are all nonconvex with the involved variables. In this paper, we propose an iterative optimization method to learn the optimal value for I2T and T2I.
1. For the I2T subretrieval task, we give the following iterative optimization steps:
Step 1. Update . The problem in Eq.(1) can be rewritten as
The deep CNN parameter
of image modality can be trained by stochastic gradient descent (SGD)
Bottou (2010) with the backpropagation (BP) algorithm. In each iteration, we randomly select a minibatch samples from the database to train the network, which relieves the SGD algorithm from falling directly into the local optimal value near the initial point. Specifically, we first compute the following gradient for each instance of :(6) 
Then we can compute the according to the BP updating rule until convergency.
Step 2. Update . The optimization problem in Eq.(1) becomes
The deep CNN parameter of text modality is also trained by SGD and BP algorithm. Firstly, we compute the following gradient for each instance of :
(7) 
Then we can compute the according to the BP updating rule until convergency.
Step 3. Update . The problem in Eq.(1) can be formulated as
(8) 
The solution of Eq.(8) can be easily obtained by optimizing without relaxing discrete binary constraints . Thus, we have
(9) 
Step 4. Update P. The corresponding optimization problem can be simplified as
(10) 
Letting the derivative of Eq.(10) with respect to P to be equal to zero, we obtain
(11) 
2. For the T2I subretrieval task, we give the details of the iterative optimization algorithm as shown below:
Step 1. Update . The problem in Eq.(4) can be reduced to
(12) 
The deep CNN parameter of text modality can be learned by SGD and BP algorithm. Firstly, we compute the following gradient for each instance of :
(13) 
Then we can compute the according to the BP updating rule until convergency.
Step 2. Update . The optimization problem in Eq.(4) becomes
The deep CNN parameter of image modality can be trained by SGD and BP algorithm. Firstly, we compute the following gradient for each instance of :
(14) 
Then we can compute the according to the BP updating rule until convergency.
Step 3. Update . The problem in Eq.(4) is rewritten as follows
(15) 
Without relaxing discrete constrains, we can obtain the hash codes of Eq.(15) as
(16) 
Step 4. Update W. The optimization problem of Eq.(4) can be reformulated as follows
(17) 
The solution can also be obtained by setting the derivative of Eq.(17) with respect to W to be equal to zero, we obtain
(18) 
The final results can be obtained by repeating the above steps until convergence. Algorithm 1 summarizes the key optimization steps for the I2T task in the proposed TAADCMH.
3.4 Online query hashing
As we discussed earlier, TAADCMH is a deep asymmetric crossmodal hashing method. It learns taskadaptive hash functions for different retrieval tasks. Specifically, given a new query instance with image modality, we can obtain its hash codes for I2T retrieval task by using the following formula
Similarly, given a query instance with text modality , we can obtain the corresponding hash codes for T2I retrieval task by
4 Experimental setting
4.1 Evaluation datasets
We conduct experiments on two public crossmodal retrieval datasets: MIR Flickr Huiskes and Lew (2008) and NUSWIDE Chua et al. (2009). Both of them includes image and text modalities.
MIR Flickr includes
pairs of imagetext instances collected from Flickr website. This dataset provides 24 labels and uses them to classify the instances, each of which belongs to at least one category. We select 20,015 instances labeled with no less than 20 textual tags to compose the final dataset. For the convenience, the query set of
multimodal data were chosen by random selection, the retrieval set is composed of the remaining multimodal data. Within the retrieval set, the training set of 10,000 instances is further chosen on a random basis. We describe each text as a 1,386dimensional BOW vector. To be fair, the input of shallow methods are dimensional CNN feature, and the input of deep methods are original image pixies.NUSWIDE includes instances with 81 semantic labels downloaded from Flickr website. Considering the imbalance of label distribution, we select the top 21 most common categories and ultimately obtain imagetext pairs as our final dataset. In our experiments, we choose 2,000 pairs instances for query, pairs instances for retrieval, pairs instances for training. The text of each instance is expressed as a dimensional BOW vector. For the traditional methods, the image of each instances is described by a deep feature with dimension. For the deep methods, each image uses the original pixel directly as the input.
4.2 Evaluation baselines and metrics
We compare our proposed TAADCMH with several typical crossmodal retrieval methods, including SCM Zhang and Li (2014), SePH Lin et al. (2015), SMFH Tang et al. (2016), STMH Wang et al. (2015), DCH Xu et al. (2017), DLFH Jiang and Li (2019), LCMFH Wang et al. (2018a) and DCMH Jiang and Li (2017)
. Note that there are two versions of several methods that have different optimization algorithms. In our experiment, the sequential learning is used to learn SCM method, kmeans is used for SePH method, and kernel logistic regression is used for DLFH method. Among the eight compared baselines, DCMH is a deep method and the others are shallow methods. SCM and SMFH are two relaxation methods, which discard discrete constraints during the process of hash code quantization. Other methods directly adopt discrete optimization to solve the hash codes.
To evaluate the retrieval performance, we adopt mean Average Precision (mAP) Xu et al. (2016) and topKprecision Lu et al. (2019); Xu (2016)
as the evaluation metrics.


4.3 Implementation details
Our method formulates two objective functions, which consist of eight parameters: , , , , , , , for two different crossmodal retrieval tasks. In the task of I2T retrieval, the regularization parameters , control the regression the deep features to the asymmetric binary codes for images and text respectively. The parameter ensures the deep features of image modality are ideal for supervised classification, is the regularization parameter to avoid overfitting. The parameters of the T2I retrieval task are similar to I2T task. We set different values for the involved parameters to optimize the I2T and T2I retrieval tasks. In our experiment, the best performance of I2T task can be achieved when , and that of T2I task can be achieved when , on MIR Flickr. Besides, the best performance can be obtained for I2T task when , and that for T2I task when on NUSWIDE. In all cases, the iterations number is set to 500. Moreover, we implement our method on Matconvnet and use CNN deep networks which are the same as DCMH. Before the training process, we initializing the weight after pretreatment of the original data on the ImageNet dataset. In the network learning process, we use the raw pixels for images and the BOW vectors for texts as inputs to the deep networks, respectively. The learning rate is in the range of . The batch size is set to 128 for two couples of deep networks. All experiments are carried out on a computer with Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and 64bit Ubuntu 16.04.6 LTS operating system.
Methods  I2T  T2I  
16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  
SMFH  53.62  55.36  55.70  56.58  52.07  53.28  54.44  54.52 
STMH  60.08  62.80  64.58  66.59  59.67  60.69  63.53  66.60 
SCM  67.47  68.50  70.91  73.65  68.50  70.93  74.16  75.01 
SePH  70.15  71.33  73.53  74.28  74.31  74.95  75.42  76.75 
DLFH  81.29  82.35  83.37  83.91  84.26  85.62  86.06  87.12 
DCH  80.43  81.92  82.86  83.82  81.04  83.23  83.70  84.93 
LCMFH  74.32  75.29  76.86  77.18  76.47  77.52  78.79  79.21 
DCMH  83.42  84.09  85.55  86.83  86.96  87.71  87.96  88.01 
TAADCMH  88.52  89.47  90.79  91.30  91.36  92.91  93.30  93.87 
5 Experimental results
5.1 Retrieval accuracy comparison
In the experiments, we first report the mAP values of eight compared methods on two datasets. The mAP results of all baselines with the hash code length ranging from 16 bits to 128 bits are presented in Table 2 and Table 3. On the basis of these results, we can reach the following conclusion: 1) For both I2T and T2I subretrieval tasks, TAADCMH is consistently outperforms all the compared methods with different codes lengths. These results clearly prove the feasibility of our method. The main reasons for the superior performance are: Firstly, TAADCMH trains two couples of deep neural networks to perform different retrieval tasks independently, which can enhance the nonlinear representation of deep features and capture the query semantics of two subretrieval tasks. Secondly, we jointly adopt the pairwise and pointwise semantic labels to generate binary codes which can express semantic similarity of different modalities. 2) It is noteworthy that the mAP values of TAADCMH is in the upward trend with the code length increasing. These results demonstrate that longer binary codes have stronger discriminative capability with effective discrete optimization in our method. 3) The results of most methods can have higher mAP values on the T2I task than that obtained on I2T task. This depends on the fact that text features can better reflect the semantic information of instances. 4) The deep TAADCMH method makes significant improvement compared with the shallow methods on two datasets. Note that, DCMH is based on deep model can have the second best performance. This phenomenon is attributed to the deep feature representation extracted by the nonlinear projections and the semantic information used in binary hash mapping. The results prove that the representation capability of deep neural network is better.
Methods  I2T  T2I  
16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  
SMFH  38.12  40.90  41.18  41.34  34.03  36.70  38.53  40.52 
STMH  52.58  53.68  53.76  54.75  47.92  49.56  52.06  53.08 
SCM  56.33  57.83  59.08  60.33  57.97  59.43  59.50  60.81 
SePH  66.51  67.71  70.46  71.44  70.44  72.03  73.78  74.54 
DLFH  67.31  68.52  72.78  73.24  71.70  72.14  74.78  75.92 
DCH  67.23  68.97  72.57  74.85  72.52  74.13  76.84  77.01 
LCMFH  67.07  68.06  69.31  69.43  70.64  71.07  72.99  73.87 
DCMH  69.83  71.86  74.07  75.42  70.42  72.26  75.08  75.64 
TAADCMH  75.52  78.92  79.13  79.39  79.68  79.13  80.47  81.32 
Next, we illustrate the performance of topKprecision curves from 32 bits to 128 bits on two datasets. Figure 2(a) plots on the MIR Flickr, and figure3(a) plots the performance of topKprecision on NUSWIDE. As these figures show, TAADCMH can obtain the higher precision and reliability compared with other baselines on both I2T and T2I subretrieval tasks with different code lengths. Moreover, we can also observe that the topKprecision curves of TAADCMH is relatively stable with the increase of retrieved samples . These observations are enough to prove that TAADCMH has strong ability to retrieve the relevant samples effectively. Compared with I2T retrieval task, the topKprecision of TAADCMH can obtain much better performance than the baseline methods on T2I crossmodal retrieval task. It is consistent with the mAP values in Table 2 and 3. During the practical retrieval process, users browse the website according to the ranking of retrieval results, so they are interested on the topranked instances in the retrieved list. Thus, TAADCMH is significantly outperforms the comparative methods on two subretrieval tasks.
To summarize, TAADCMH achieves superior performance on MIR Flickr and NUSWIDE. These phenomenon validates that capturing the query semantics of different crossmodal retrieval tasks is effective when learning the crossmodal hash codes. All the results confirm the effectiveness of our designed loss functions and optimization scheme.
Methods  I2T  T2I  
16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  
TAADCMHI  88.28  88.76  89.57  91.03  90.45  91.31  91.45  92.08 
TAADCMHII  83.12  84.63  85.51  86.47  86.27  87.01  87.84  89.25 
TAADCMHIII  88.52  89.47  89.75  90.77  89.68  91.48  91.76  92.03 
TAADCMH  88.52  89.47  90.79  91.30  91.36  92.91  93.30  93.87 
Methods  I2T  T2I  
16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  
TAADCMHI  75.46  76.58  78.11  79.17  79.43  79.97  80.86  81.16 
TAADCMHII  68.36  69.01  72.35  74.28  68.79  71.07  73.18  74.56 
TAADCMHIII  69.36  75.24  76.90  77.47  78.52  79.12  80.43  81.13 
TAADCMH  75.52  78.92  79.13  79.39  79.68  80.01  80.47  81.32 
5.2 Effects of taskadaptive hash function learning
Our method is designed to learn taskadaptive hash functions by additionally regressing the query modality representation to the class label. With the further semantic supervision, the queryspecific modality representation can effectively capture the query semantics of different crossmodal retrieval tasks. To verify the effects of this part, we design two variant methods TAADCMHI and TAADCMHII for performance comparison. 1) TAADCMHI directly performs semantic regression from class label to the shared hash codes instead of the queryspecific modality representation. Mathematically, the optimization objective function of TAADCMHI becomes
The binary codes is calculated as , where , , and are balance parameters. L is the class label. V is the projection matrix which regresses the hash codes B to the semantic label L. The term is employed to make the balance of each bit of hash codes for all the training points. 2) The other variant method TAADCMHII performs the pairwise semantic supervision without employing any semantic information. Mathematically, the optimization objective function of TAADCMHII becomes
The binary codes is computed as , where , , and are balance parameters. is the deep features of images. is the deep features of texts. The performance comparison results of the two variant methods are shown in Tables 4 and 5 on both two subretrieval tasks. Two tables demonstrate a fact that our method can outperform the variants TAADCMHI and TAADCMHII on two datasets with all code lengths for crossmodal retrieval. These results prove that the taskadaptive hash function learning is effective on improving the crossmodal retrieval performance.
5.3 Effects of discrete optimization
We devise a variant method TAADCMHIII for comparison to validate the effect of discrete optimization. Specifically, we utilize a relaxing strategy which first adopt continuous constraints instead of discrete constraints and then binarize the realvalued solution into hash codes by thresholding. In Eq.(
1) and Eq.(4), we directly discard the constraints and . Therefore, the relaxed hash codes can be calculated as for I2T task, for T2I task. Tables 4 and 5 show the comparison results of TAADCMHIII and TAADCMH on MIR Flickr and NUSWIDE, respectively. The results demonstrate that TAADCMH can achieve superior performance than TAADCMHIII, which further validates the quantization errors are minimized by the effect discrete optimization.5.4 Parameter experiments
The empirical analysis of parameter sensitivity are conducted on MIR Flickr with 32 bits. Specifically, we report the mAP performance with eight involved parameters for two different retrieval tasks. In the I2T retrieval task, there are four involved parameters: , , and . For and , we tune them from by fixing the other parameters. In Figure 4 (a) and (d), the performance of and is relatively stable with the value of for I2T retrieval task. The mAP performance decreases sharply as the two parameters are varied from to . For the parameter and , we tune them from to . As shown in Figure 4 (b) and (c), the I2T retrieval task is indeed influenced by and . Specifically, with increasing ( ), the mAP performance is relatively stable. However, when , the performance degrades quickly. Hence, can be chosen within the range of . Moreover, it is capable of observation from Figure 4 (c) that the performance of is also relatively stable with the range of . The mAP values of decreases quickly when is larger than . In particular, can be chosen from the range between . The remain parameters , , and are involved for T2I retrieval task. For , , , we tune them from by fixing the other parameters. For , we tune them from to . The detailed explanation of results are shown in Figure 5. From Figure 5 (a), (c), (d), we have a conclusion that the mAP results of T2I task is in a steady trend when ranges from to , ranges from to and ranges from to . From Figure 5 (b), it can be discovered that the best performance of T2I task can be obtained when is in , and its performance decreases when is larger than . In general, we can reach the conclusion that the parameters of TAADCMH are of vital importance to our experiments and they can be stable within a reasonable range of values.
5.5 Convergence analysis
To analyze the convergence of TAADCMH, we display the convergence curves in Figure 6 with 32 bits on MIR Flickr and NUSWIDE. For I2T retrieval task, the convergence analysis results of the formula Eq.(1) are recorded in Figure 6 (a) and (c). For T2I retrieval task, the convergence analysis results of the formula Eq.(4) are recorded in Figure 6 (b) and (d). In all the figures, abscissa is the iteration numbers and ordinate is the value of the objective function. As shown in the figures, we can find a fact that TAADCMH achieves a stable minimum within 300 iterations for I2T task and within 400 for T2I task on MIR Flickr dataset. We can also find that it converges within 300 iterations for both two retrieval tasks on NUSWIDE dataset. The experimental results confirm that TAADCMH can converge gradually.
6 Conclusion
In this work, we propose a Taskadaptive Asymmetric Deep Crossmodal Hashing (TAADCMH) method. It learns taskadaptive hash functions for different crossmodal retrieval tasks. The deep learning framework jointly optimizes the semantic preserving from multimodal deep representations to the hash codes, and the semantic regression from the queryspecific representation to the explicit labels. The hash codes we learned can effectively preserve the multimodal semantic correlations, and meanwhile, adaptively capture the query semantics. Further, we devise a discrete optimization scheme to effectively solve the discrete binary constraints of binary codes. On two datasets, we prove the superiority of our TAADCMH method.
References
References

Largescale machine learning with stochastic gradient descent
. In Proceedings of the International Conference on Computational Statistics (COMPSTAT), pp. 177–186. Cited by: §3.3.  Crossmodal retrieval in the cooking context: learning semantic textimage embeddings. In Proceedings of the ACM International Conference on Research on Development in Information Retrieval (SIGIR), pp. 35–44. Cited by: §1.
 NUSwide: a realworld web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Viedo Retrieval (CIVR), pp. 48. Cited by: §4.1.

Imagenet: a largescale hierarchical image database.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 248–255. Cited by: §3.2.1.  Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2075–2082. Cited by: §1, §1, §2.1.
 Deep binary reconstruction for crossmodal hashing. IEEE Transactions on Multimedia 21 (4), pp. 973–985. Cited by: §1, §2.3.
 Collective reconstructive embeddings for crossmodal hashing. IEEE Transactions on Image Processing 28 (6), pp. 2770–2784. Cited by: §1, §2.1.
 The mir flickr retrieval evaluation. In Proceedings of the ACM International Conference on Multimedia Information Retrieval (MIR), pp. 39–43. Cited by: §4.1.
 Deep crossmodal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3232–3240. Cited by: §1, §2.3, §4.2.
 Discrete latent factor model for crossmodal hashing. IEEE Transactions on Image Processing 28 (7), pp. 3490–3501. Cited by: §2.2, §4.2.

Learning hash functions for crossview similarity search.
In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)
, pp. 1360–1365. Cited by: §1, §2.1.  Selfsupervised adversarial hashing networks for crossmodal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4242–4251. Cited by: §2.3.
 Semanticspreserving hashing for crossview retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3864–3872. Cited by: §1, §2.2, §4.2.
 Crossmodal discrete hashing. Pattern Recognition 79, pp. 114 – 129. External Links: Document Cited by: §1, §2.1.
 Fast discrete crossmodal hashing with regressing from semantic labels. In Proceedings of the ACM International Conference on Multimedia (MM), pp. 1662–1669. Cited by: §1.
 Efficient discrete latent semantic hashing for scalable crossmodal retrieval. Signal Processing 154, pp. 217 – 231. External Links: Document Cited by: §4.2.
 Adversarial crossmodal retrieval based on dictionary learning. Neurocomputing 355, pp. 93–104. Cited by: §1.
 Supervised discrete hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 37–45. Cited by: §2.2.
 Intermedia hashing for largescale retrieval from heterogeneous data sources. In Proceedings of the ACM Conference on Management of Data (SIGMOD), pp. 785–796. Cited by: §1, §2.1.
 Polysemous visualsemantic embedding for crossmodal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1979–1988. Cited by: §1.
 Supervised matrix factorization hashing for crossmodal retrieval. IEEE Transactions on Image Processing 25 (7), pp. 3157–3166. Cited by: §1, §1, §2.2, §4.2.
 Adversarial crossmodal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM), pp. 154–162. Cited by: §1.
 Label consistent matrix factorization hashing for largescale crossmodal similarity search. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (10), pp. 2466–2479. Cited by: §1, §1, §2.2, §4.2.
 Semantic topic multimodal hashing for crossmedia retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 3890–3896. Cited by: §1, §1, §2.2, §4.2.
 Robust and flexible discrete hashing for crossmodal similarity search. IEEE Transactions on Circuits and Systems for Video Technology 28 (10), pp. 2703–2715. Cited by: §2.1.

Joint feature selection and graph regularization for modalitydependent crossmodal retrieval
. Journal of Visual Communication and Image Representation 54, pp. 213–222. Cited by: §1.  Taskdependent and querydependent subspace learning for crossmodal retrieval. IEEE Access 6, pp. 27091–27102. Cited by: §1.
 Fusionsupervised deep crossmodal hashing. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME), pp. 37–42. Cited by: §1.
 Unsupervised deep crossmodal hashing with virtual label regression. Neurocomputing. External Links: Document Cited by: §1.
 Spectral hashing. In Advances in neural information processing systems(NIPS), pp. 1753–1760. Cited by: §2.1.

Online crossmodal hashing for web image retrieval
. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), pp. 294–300. Cited by: §1.  Unsupervised multigraph crossmodal hashing for largescale multimedia retrieval. Multimedia Tools and Applications 75 (15), pp. 9185–9204. Cited by: §1.
 Crossmodal selftaught hashing for largescale image retrieval. Signal Processing 124, pp. 81–92. Cited by: §1.
 Learning discriminative binary codes for largescale crossmodal retrieval. IEEE Transactions on Image Processing 26 (5), pp. 2494–2507. Cited by: §1, §2.2, §4.2.
 Dictionary learning based hashing for crossmodal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM), pp. 177–181. Cited by: §4.2.
 Learning unified binary codes for crossmodal retrieval via latent semantic hashing. Neurocomputing 213, pp. 191–203. External Links: Document Cited by: §4.2.
 Robust discrete spectral hashing for largescale image semantic indexing. IEEE Transactions on Big Data 1 (4), pp. 162–171. Cited by: §3.2.1.
 Adaptive semisupervised feature selection for crossmodal retrieval. IEEE Transactions on Multimedia 21 (5), pp. 1276–1288. Cited by: §1.
 Largescale supervised multimodal hashing with semantic correlation maximization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2177–2183. Cited by: §1, §2.2, §4.2.
 Unsupervised generative adversarial crossmodal hashing. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 539–546. Cited by: §1, §2.3.
 Deep discrete crossmodal hashing for crossmedia retrieval. Pattern Recognition 83, pp. 64–77. External Links: Document Cited by: §1, §2.3.
 Latent semantic sparse hashing for crossmodal similarity search. In Proceedings of the ACM International Conference on Research on Development in Information Retrieval (SIGIR), pp. 415–424. Cited by: §2.1.
 Linear crossmodal hashing for efficient multimedia search. In Proceedings of the ACM International Conference on Multimedia (MM), pp. 143–152. Cited by: §1, §2.1.
Comments
There are no comments yet.