Due to the fast development of Internet, different types of media data grow rapidly, e.g., texts, images and videos. These different types of data may describe the same events or topics. For example, the photos in Flickr are allowed users to give interactive comments. Hence, developing a retrieval model for multi-modal data is a desired requirement. Cross-modal retrieval, which takes one type of data as the query and return the relevant data of another type, is receiving increasing attention since it is a natural searching way for multi-modal data. The solution methods can be roughly divided into two categories : real-valued representation learning and binary representation learning. Since the low storage cost and fast retrieval speed of the binary representation, we only focus on cross-modal binary representation learning (i.e., Hashing) in this paper.
have been proposed for embedding correlations among different modalities of data. In the cross-modal hashing procedure, feature extraction is considered as the first step for representing all modalities of data, and then one project these multi-modal features into a common Hamming space for future search. Many methods[4, 3] use shallow architecture for feature extraction. For example, collective matrix factorization hashing (CMFH)  and semantic correlation maximization (SCM) 
use the hand-crafted features. Recently, deep learning has also been adopted for cross-modal hashing due to its powerful ability of learning good representations of data. The representative works of deep-network-based cross-modal hashing includes deep cross-modal hashing (DCMH), deep visual-semantic hashing (DVSH) , pairwise relationship guided deep hashing (PRDH)  and so on.
In parallel, the computational model of “attention” has drawn much interest due to its impressive result in various applications, e.g., image caption . It is also desired for cross-modal retrieval problem. For example, as shown in Figure 1, given a query girl sits on donkey, if we can locate the more informative objects in image (e.g., the black regions), the more accuracy can be obtained. To the best of our knowledge, the attention mechanism has not been well explored for cross-modal hashing.
In this paper, we propose an adversarial hashing network with attention mechanism for cross-modal hashing. Ideally, good attention masks should locate discriminative regions, which also mean the unattended regions of data are uninformative and hard to preserve similarities. Hence, in our proposed network, adaptive attention masks are generated for the multi-modal data, then the learned masks divide the data into attended samples(only keep foregrounds of the data) and unattended samples(only keep backgrounds of the data). Hinging on such attention masks, a good discriminative hashing should preserve the similarities for both the foreground samples (which can be viewed as easy examples) and background samples (hard examples) for enhancing the robustness and performance of the learned hash functions. And the good generator should generate attention masks that make discriminator cannot preserve the similarities of the background samples, for unattended regions of data should not be discriminative.
Based on this, we present a new adversarial model called HashGAN, which is illustrated in Figure 2 and consists of three major components: (1) feature learning module which uses CNN or MLP to extract high level semantic representations for the multi-modal data, (2) generative attention module which generates the adaptive attention masks and divides the feature representations into the attended and the unattended feature representations, and (3) discriminative hashing module which focus on learning the binary codes for the multi-modal data. HashGAN trains two adversarial networks alternatively: the discriminator is learned to preserve the similarities for both the easy foreground feature representations and the hard background feature representations, while the generator learns to produce masks that make the discriminator fails to keep similarities of the background feature representation. The adversarial retrieval loss and cross-modal retrieval loss are proposed to obtain good attention masks and powerful hash functions.
The main contributions of our work are three-fold. First, we propose an attention-aware method for cross-modal hashing problem, which is able to detect the informative regions of multi-modal data. Second, we propose an HashGAN for learning effective attention masks and compact binary codes simultaneously. Third, we quantitatively evaluate the usefulness of attention in cross-modal hashing and our method yields better performances by comparing with several state-of-the-art methods.
2 Related Work
2.1 Cross-modal Hashing
According to the utilized information for learning the common representations, cross-modal hashing can be categorized into three groups : 1) unsupervised methods , 2) pairwise based methods [11, 2] and 3) supervised methods [12, 13]. The unsupervised methods only use co-occurrence information to learn hash functions for multi-modal data. For instance, cross-view hashing (CVH)  extends spectral hashing from uni-modal to multi-modal scenario. The pairwise based methods use both the co-occurrence information and similar/dissimilar pairs to learn the hash functions. Bronstein et al.  proposed cross-modal similarity sensitive hashing (CMSSH), which learn the hash functions to ensure that if two samples (with different modalities) are relevant/irrelevant, their corresponding binary codes are similar/dissimilar. The supervised methods exploit label information to learn more discriminative common representation. Semantic correlation maximization (SCM) 
uses label vector to obtain the similarity matrix and reconstruct it through the binary codes.
However, most of these works are based on hand-crafted features. Recently, deep learning methods show that they can effectively discover the correlations across different modalities. The most representative work is deep cross-modal hashing (DCMH) . DCMH integrates feature learning and hash-code learning into the same framework. Cao et.al. 
proposed deep visual-semantic hashing (DVSH), which utilizes both the convolutional neural network (CNN) and long short term memory (LSTM) to separately learn the common representations for each modality. Pairwise relationship guided deep hashing (PRDH) also adopts deep CNN models to learn feature representations and hash codes simultaneously.
However, all these methods encode an entire data point into a binary representation. Few works attend to introduce attention mechanism into cross-modal hashing.
2.2 Attention Models
Attention-aware methods capture where the model should focus on when performing a particular task. The attention mechanism has been proved to be very powerful in many applications, such as image classification , image caption , image question answering , video action recognition  and etc. For example, Xu et al.  proposed two forms of attention for image caption: a “hard” attention mechanism trained by REINFORCE and a “soft” attention mechanism trained by standard back-propagation methods. Stacked attention networks (SANs)  take multiple steps to progressively focus the attention on the relevant regions and lead to a better answer for image QA. Sharma et al. 
proposed a soft attention based model for action recognition which uses recurrent neural networks (RNNs) with long short-term memory (LSTM) unit to obtain both the spatial and temporal information.
2.3 Generative Adversarial Network
Generative adversarial networks (GANs) have been received a lot of interest in generative modelling problems. The original GAN  train two models: the discriminative model and the generative model . The discriminative model learns to determine whether a sample is from the model distribution or data distribution. The generative model attempts to produce a sample that can fake the discriminative model.
Recently, several approaches have been proposed to improve the original GAN. For example, DCGAN , CGAN  and Wasserstein GAN . IRGAN  is a recently proposed method for information retrieval, in which the generative retrieval focusing on predicting relevant documents and the discriminative retrieval focusing on predicting relevancy given a query document pair. Different from our method, IRGAN is designed for uni-modal retrieval and it is not an attention-aware method, yet.
In this paper, we extend GAN to cross-modal hashing. We carefully design a new GAN, called HashGAN, to generate attention-aware common representations and to learn similarity-preserve hash functions.
3.1 Problem Definition
Suppose there are training samples, each of which is represented in several modalities, e.g., audio, video, image, text, etc. In this paper, we only focus on two modalities: text and image. Note that our method can be easily extended to other modalities.
We denote the training data as , where is the -th image and is the corresponding text description of image . We also have a cross-modal similarity matrix , where means the -th image and the -th text are similar and means they are dissimilar.
The goal of cross-modal hashing is to learn two mapping functions to transform image and text into a common binary codes space respectively, in which the similarities between paired images and texts are preserved. Formally, Let and be denoted as the generated -bit binary codes for image and text, respectively. If , the -th image and the -th text are similar. Hence, the Hamming distance between and should be small. When , the Hamming distance between them should be large.
3.2 Network Architecture
We propose HashGAN for cross-modal problem, which contains three type of networks: 1) feature learning networks for obtaining the high-level representations of the multi-modal data, 2) generative attention network for learning the attention distributions, and 3) discriminative hashing networks for learning the binary codes for cross-modal image retrieval.
3.2.1 Feature Learning Components: and
For image modality, the convolutional neural network is used to obtain the high-level representation of images. In this paper, we use VGGNet  as the basic network to generate the feature maps as shown in Figure 3. Let be denoted as the image feature maps from the -th raw image.
For text modality, we use multi-layer perceptron (MLP) to obtain the powerful semantic representation of texts. Following DCMH, we also use bag-of-words (BOW) as the feature representation for text modality. There are two fully-connected layers as shown in Figure 3. We denote as the feature vector for the -th text.
3.2.2 Generative Attention Components: and
With the powerful image feature maps and the text feature vector , we first feed them into one layer neural network, i.e., a convolutional layer with kernel size for image feature maps and a fully-connected layer for text feature vector, and then followed by a softmax function and a threshold function to generate the attention distribution over the regions of multi-modal data.
More specially, Figure 4 shows the pipeline in details for processing image modality. Suppose is the feature maps for the -th image, where , and are the height, weight and channels of the feature maps, respectively. In the first step, the feature maps are mapped into the mask by a convolutional layer with kernel size. Next, the mask goes through a softmax layer and the output is denoted as , which is defined as
where and denote the value in the -th row and -th column of the matrix and the matrix , respectively. The elements in
form a probability distribution, whereand .
A larger value in correspond to the foregrounds and the backgrounds may have a smaller response. Thus, in the third step, we add a threshold layer to divide the data into the attended regions and the unattended regions, which is defined as
where is a predefined threshold. We set in our experiment. The output of the threshold layer is a binary mask, with the elements inside be either 0 or 1. The regions with the value 1 are regarded as the foregrounds or the regions that are attend to, while other regions are regarded as background regions.
Based on the attention distribution, we can calculate the attention-aware feature maps and inattention-aware feature maps for the -th image by multiplying the binary mask in element-wise, which is formulated as
for all and . The foreground is and the background is . For ease of representation, we denote the whole procedures as .
For text modality, we imitate the pipeline similar to image modality which is shown in Figure 5. Since there are feature vectors rather than feature maps, we use fully-connected layer instead of the convolutional layer, then it is fed to softmax and threshold respectively. Formally, we compute
where and are two parameters in the fully-connected layer, and is Kronecker product. We denote as the attention-aware and inattention-aware features for -th text.
While taking the derivative of the threshold function directly is incompatible with the back-propagation in training. Specifically, suppose that
is the loss function, we need to use
in updating the network parameters by stochastic gradient descent (SGD) during back-propagation. However, the derivativein the threshold layer is almost zero everywhere according to the definition of
. Besides, by the chain rule:, we can see that is also nearly zero immediately. Eventually, such an almost zero-valued node may block the back-propagation process.
3.2.3 Discriminative Hashing Components: and
The discriminator networks encode the high-level features for two modalities into binary codes.
Since we adopt VGGNet as our basic architecture, we simply use the last full-connected layers, e.g., fc6 and fc7 111The last fully-connected layer (e.g., fc8) is removed since it is for classification problem. , to encode the images into binary codes. And then we add a fully-connected layer with dimensional features which followed by a tanh layer that restricts the values in the range . Let the outputs of image discriminator network are and as the binary codes for the -th attention-aware feature maps and inattention-aware feature maps, respectively.
For text modality, we also simply add a fully-connected layer and a tanh layer to encode the text features into bits. Similar with image discriminator, and are denoted as the binary codes for the attention-aware and inattention-aware features, respectively.
3.3 Hashing Objectives
Our objective contains two types of terms: 1) cross-modal retrieval loss, which learns to keep the similarities between different modalities of data, 2) adversarial retrieval loss, generating the attention distribution.
3.3.1 Cross-modal Retrieval Loss
The aim of cross-modal loss function is to keep the similarities between images and texts. To keep the semantic similarities, inter-modal ranking loss and intra-modal ranking loss are used according to . That is the hash codes from different modalities should preserve semantic similarity, and the hash codes from same modality should also preserve semantic similarity.
The cross-modal retrieval loss can be formulated as
where is denoted as the modality is taken as the query to retrieve the relevant data of the modality where and . For example, means text queries are used to retrieve relevant images. We denote as the similarity preserving loss, and is the loss function for modality as query and modality as database. The first two terms are used to preserve the semantic similarity between different modalities, and the last two terms are to preserve the similarity in their own modality.
We take as an example for illustration. Given a binary code of the -th text, good hash functions should require that the similar images should rank ahead of the dissimilar images. That is we should have in Hamming space when . Formally, can be defined as
where is the triplet form. The objective is the triplet ranking loss  which show effectiveness in the uni-modal retrieval.
Similar with that, can be defined as
The can be formulated as
3.3.2 Adversarial Retrieval Loss
Inspired by the impressive results in image generation of the generative adversarial network (GAN), we adopt it for generating the attention distribution. Similar with GAN, our method also has two models: generative attention modeland discriminative hashing model . Models is to preserve the semantic similarity between different modalities. While tries to generate attention distribution as described in Subsection 3.2.2. The inattention-aware features from should let fail to keep the semantic similarities. Hence, the adversarial loss can be expressed as
where and are the generated inattention-aware features. Note that and . The try to maximize the loss and is to minimize the objective.
3.3.3 Full Objective
Our full objective is
Similar to GAN, we train our model alternatively. The parameters in and are fixed and other parameters are trainable:
And then are fixed and update the generative attention models:
|Task||Methods||IAPR TC-12||MIR-Flickr 25k||NUS-WIDE|
In this section, we evaluate the performance of our proposed methods on three datasets and compare it with several stage-of-the-art algorithms.
IAPR TC-12 is a popular dataset for cross modal retrieval. It consists of 20,000 still natural images which are collected from wide domains, with at least one sentence description for each image. The image-text pairs are multi-label, and 255 concept categories are set as the ground truth labels. In our experiment, we use the whole dataset. For image modality, we use the raw pixels directly, and for each text sample, we convert the sentence descriptions into 2912 dimensional bag-of-words vectors.
MIR-Flickr 25k includes 25,000 multi-label images that are downloaded from the photo-sharing website Flickr.com. The textual descriptions of each image are several words. Each instance holds one or more labels among 24 concept categories. In our experiment, we first get rid of the textual words counting less than 20 times, then the image-text pairs lacking in textual words or labels are deleted from the original dataset. Afterwards, we have 20,015 instances. For image modality, we use raw pixels as before, while 1386 dimensional bag-of-words vectors are used to indicate text points for text modality.
NUS-WIDE is a widely used dataset for cross modal retrieval which consists of 269,648 multi-label images. Just as MIR-Flickr, the textual representation of each image is several associated words as well. There are 81 concept categories provided for evaluation. In our experiment, we choose the image-text pairs that belong to the 21 most frequent labels and 1,000 textual words, and the number of which is up to 195,834 subsequently. For image modality, we still use raw pixels, and 1000 dimensional bag-of-words vectors are used for text modality meanwhile.
In order to establish the training and test sets, we choose 2,000 image-text pairs in IAPR TC-12 and MIR-Flickr datasets randomly as test sets, or in other words, query sets. The rest instances form the retrieval sets. 10,000 random samples are selected from the retrieval set as our training sets. Besides, for NUS-WIDE dataset, we select 2,100 image-text pairs as the test or query set. The rest consists the retrieval set, while 10,500 random instances from the retrieval set become the training set. Table 2 shows the number of samples in each set intuitively.
4.2 Experimental Settings And Evaluation Measures
We implement our codes based on the open source caffe framework. In training, the networks are updated alternatively through the stochastic gradient solver, i.e., ADAM (, ). We alternate between 4 steps of optimizing and 1 step of optimizing
. We initialize the VGGNet on the ImageNet dataset
except the last layer. For text modality, all parameters are randomly initialized. The batch size is 64 and the total epoch is 100. The base learning rate is 0.005 and it is changed to one tenth of the current value after every 20 epochs. In testing, we use only the attention-aware features, i.e., foregrounds, of the data to construct the binary codes.
All the samples are ranked according to their Hamming distance from the query. To evaluate the performance of hashing models, we use two metrics: mean average precision (MAP)
and precision-recall curves. MAP is a standard evaluation metric for information retrieval, which is the mean of averaged precision over a set of queries.
4.3 Comparison with State-of-the-art Methods
The comparison results of search accuracies on all of the three datasets are shown in Table 1. From the table we can see that our method outperforms other baselines and achieves excellent performances. For example, the MAP of our method is 0.5458 compared to 0.5185 of the second best algorithm DCMH. The precision-recall curves are also shown in Figure 7. It can be seen that our method shows comparable performance over the existing baselines.
We also explore the effects of small network architecture in feature learning module for image modality since VGGNet is a large deep network. In this experiment, we select CNN-F  as the basic network for the image modality. The comparison results are shown in Table 3. We can see that VGGNet performs better than CNN-F while our method using CNN-F also achieves good performance when compared to other state-of-the-art baselines.
The main reason for the good performance of our method is that we can obtain attention distribution for the multi-modal data. Figure 8 shows some examples of the image modality. Note that it is hard to visualize the text modality (the networks for text modality use fully-connected layers instead of CNN, and the order of words in BOW are changed), thus we do not show the masks learned in text network.
In the paper, we proposed a novel approach called HashGAN for the cross-modal hashing based on the idea of adversarial architecture. The proposed HashGAN contains three major components: feature learning module, generative attention module and the discriminative hashing module. The feature learning module learns powerful representations for multi-modal data. The generator and discriminator play two-player minimax game, in which discriminator tries to minimize the similarity-preserving loss functions while generator aims to maximize the retrieval loss of the inattention-aware features. We performed our method on three datasets and the experimental results demonstrate the appealing performance of our method.
-  K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang, “A comprehensive survey on cross-modal retrieval,” arXiv preprint arXiv:1607.06215, 2016.
-  Y. Zhen and D.-Y. Yeung, “Co-regularized hashing for multimodal data,” in NIPS, pp. 1376–1384, 2012.
-  D. Zhang and W.-J. Li, “Large-scale supervised multimodal hashing with semantic correlation maximization.,” in AAAI, vol. 1, p. 7, 2014.
-  G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in CVPR, pp. 2075–2082, 2014.
-  Z. Lin, G. Ding, M. Hu, and J. Wang, “Semantics-preserving hashing for cross-view retrieval,” in CVPR, pp. 3864–3872, 2015.
-  Q. Y. Jiang and W. Li, “Deep cross-modal hashing,” in CVPR, 2016.
-  Y. Cao, M. Long, J. Wang, Q. Yang, and S. Y. Philip, “Deep visual-semantic hashing for cross-modal retrieval.,” in KDD, pp. 1445–1454, 2016.
-  E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise relationship guided deep hashing for cross-modal retrieval.,” in AAAI, pp. 1618–1625, 2017.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, pp. 2048–2057, 2015.
-  D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes for multimodal representations using orthogonal deep structure,” TMM, vol. 17, no. 9, pp. 1404–1416, 2015.
-  J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, “Multimodal similarity-preserving hashing,” TPAMI, vol. 36, no. 4, pp. 824–830, 2014.
-  Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast cross-media retrieval,” in SIGIR, pp. 395–404, 2014.
Y. Cao, M. Long, J. Wang, and H. Zhu, “Correlation autoencoder hashing for supervised cross-modal search,” inICMR, pp. 197–204, 2016.
-  L. Sun, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in ICML, pp. 1024–1031, 2008.
R. He, W.-S. Zheng, and B.-G. Hu, “Maximum correntropy criterion for robust face recognition,”TPAMI, vol. 33, no. 8, pp. 1561–1576, 2011.
-  J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recongnition with visual attention,” in ICLR, 2015.
-  Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in CVPR, pp. 21–29, 2016.
-  S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, pp. 2672–2680, 2014.
-  A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
-  M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang, “Irgan: A minimax game for unifying generative and discriminative information retrieval models,” arXiv preprint arXiv:1705.10513, 2017.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016.
-  H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, pp. 3270–3278, 2015.
-  H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger, “The segmented and annotated iapr tc-12 benchmark,” CVIU, vol. 114, no. 4, pp. 419–428, 2010.
-  M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in ICMIR, pp. 39–43, 2008.
-  T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a real-world web image database from national university of singapore,” in ICIVR, p. 48, 2009.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2014.
-  W. Liu, S. Kumar, S. Kumar, and S. F. Chang, “Discrete graph hashing,” in NIPS, pp. 3419–3427, 2014.
-  H. Hotelling, RELATIONS BETWEEN TWO SETS OF VARIATES. Springer New York, 1992.
-  D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodal hashing for cross-media retrieval,” in ICAI, pp. 3890–3896, 2015.
-  K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” Computer Science, 2014.