1 Introduction
Due to the fast development of Internet, different types of media data grow rapidly, e.g., texts, images and videos. These different types of data may describe the same events or topics. For example, the photos in Flickr are allowed users to give interactive comments. Hence, developing a retrieval model for multimodal data is a desired requirement. Crossmodal retrieval, which takes one type of data as the query and return the relevant data of another type, is receiving increasing attention since it is a natural searching way for multimodal data. The solution methods can be roughly divided into two categories [1]: realvalued representation learning and binary representation learning. Since the low storage cost and fast retrieval speed of the binary representation, we only focus on crossmodal binary representation learning (i.e., Hashing) in this paper.
To date, various crossmodal hashing algorithms [2, 3, 4, 5, 6, 7, 8]
have been proposed for embedding correlations among different modalities of data. In the crossmodal hashing procedure, feature extraction is considered as the first step for representing all modalities of data, and then one project these multimodal features into a common Hamming space for future search. Many methods
[4, 3] use shallow architecture for feature extraction. For example, collective matrix factorization hashing (CMFH) [4] and semantic correlation maximization (SCM) [3]use the handcrafted features. Recently, deep learning has also been adopted for crossmodal hashing due to its powerful ability of learning good representations of data. The representative works of deepnetworkbased crossmodal hashing includes deep crossmodal hashing (DCMH)
[6], deep visualsemantic hashing (DVSH) [7], pairwise relationship guided deep hashing (PRDH) [8] and so on.In parallel, the computational model of “attention” has drawn much interest due to its impressive result in various applications, e.g., image caption [9]. It is also desired for crossmodal retrieval problem. For example, as shown in Figure 1, given a query girl sits on donkey, if we can locate the more informative objects in image (e.g., the black regions), the more accuracy can be obtained. To the best of our knowledge, the attention mechanism has not been well explored for crossmodal hashing.
In this paper, we propose an adversarial hashing network with attention mechanism for crossmodal hashing. Ideally, good attention masks should locate discriminative regions, which also mean the unattended regions of data are uninformative and hard to preserve similarities. Hence, in our proposed network, adaptive attention masks are generated for the multimodal data, then the learned masks divide the data into attended samples(only keep foregrounds of the data) and unattended samples(only keep backgrounds of the data). Hinging on such attention masks, a good discriminative hashing should preserve the similarities for both the foreground samples (which can be viewed as easy examples) and background samples (hard examples) for enhancing the robustness and performance of the learned hash functions. And the good generator should generate attention masks that make discriminator cannot preserve the similarities of the background samples, for unattended regions of data should not be discriminative.
Based on this, we present a new adversarial model called HashGAN, which is illustrated in Figure 2 and consists of three major components: (1) feature learning module which uses CNN or MLP to extract high level semantic representations for the multimodal data, (2) generative attention module which generates the adaptive attention masks and divides the feature representations into the attended and the unattended feature representations, and (3) discriminative hashing module which focus on learning the binary codes for the multimodal data. HashGAN trains two adversarial networks alternatively: the discriminator is learned to preserve the similarities for both the easy foreground feature representations and the hard background feature representations, while the generator learns to produce masks that make the discriminator fails to keep similarities of the background feature representation. The adversarial retrieval loss and crossmodal retrieval loss are proposed to obtain good attention masks and powerful hash functions.
The main contributions of our work are threefold. First, we propose an attentionaware method for crossmodal hashing problem, which is able to detect the informative regions of multimodal data. Second, we propose an HashGAN for learning effective attention masks and compact binary codes simultaneously. Third, we quantitatively evaluate the usefulness of attention in crossmodal hashing and our method yields better performances by comparing with several stateoftheart methods.
2 Related Work
2.1 Crossmodal Hashing
According to the utilized information for learning the common representations, crossmodal hashing can be categorized into three groups [1]: 1) unsupervised methods [10], 2) pairwise based methods [11, 2] and 3) supervised methods [12, 13]. The unsupervised methods only use cooccurrence information to learn hash functions for multimodal data. For instance, crossview hashing (CVH) [14] extends spectral hashing from unimodal to multimodal scenario. The pairwise based methods use both the cooccurrence information and similar/dissimilar pairs to learn the hash functions. Bronstein et al. [15] proposed crossmodal similarity sensitive hashing (CMSSH), which learn the hash functions to ensure that if two samples (with different modalities) are relevant/irrelevant, their corresponding binary codes are similar/dissimilar. The supervised methods exploit label information to learn more discriminative common representation. Semantic correlation maximization (SCM) [3]
uses label vector to obtain the similarity matrix and reconstruct it through the binary codes.
However, most of these works are based on handcrafted features. Recently, deep learning methods show that they can effectively discover the correlations across different modalities. The most representative work is deep crossmodal hashing (DCMH) [6]. DCMH integrates feature learning and hashcode learning into the same framework. Cao et.al. [7]
proposed deep visualsemantic hashing (DVSH), which utilizes both the convolutional neural network (CNN) and long short term memory (LSTM) to separately learn the common representations for each modality. Pairwise relationship guided deep hashing (PRDH)
[8] also adopts deep CNN models to learn feature representations and hash codes simultaneously.However, all these methods encode an entire data point into a binary representation. Few works attend to introduce attention mechanism into crossmodal hashing.
2.2 Attention Models
Attentionaware methods capture where the model should focus on when performing a particular task. The attention mechanism has been proved to be very powerful in many applications, such as image classification [16], image caption [9], image question answering [17], video action recognition [18] and etc. For example, Xu et al. [9] proposed two forms of attention for image caption: a “hard” attention mechanism trained by REINFORCE and a “soft” attention mechanism trained by standard backpropagation methods. Stacked attention networks (SANs) [17] take multiple steps to progressively focus the attention on the relevant regions and lead to a better answer for image QA. Sharma et al. [18]
proposed a soft attention based model for action recognition which uses recurrent neural networks (RNNs) with long shortterm memory (LSTM) unit to obtain both the spatial and temporal information.
2.3 Generative Adversarial Network
Generative adversarial networks (GANs) have been received a lot of interest in generative modelling problems. The original GAN [19] train two models: the discriminative model and the generative model . The discriminative model learns to determine whether a sample is from the model distribution or data distribution. The generative model attempts to produce a sample that can fake the discriminative model.
Recently, several approaches have been proposed to improve the original GAN. For example, DCGAN [20], CGAN [21] and Wasserstein GAN [22]. IRGAN [23] is a recently proposed method for information retrieval, in which the generative retrieval focusing on predicting relevant documents and the discriminative retrieval focusing on predicting relevancy given a query document pair. Different from our method, IRGAN is designed for unimodal retrieval and it is not an attentionaware method, yet.
In this paper, we extend GAN to crossmodal hashing. We carefully design a new GAN, called HashGAN, to generate attentionaware common representations and to learn similaritypreserve hash functions.
3 HashGAN
3.1 Problem Definition
Suppose there are training samples, each of which is represented in several modalities, e.g., audio, video, image, text, etc. In this paper, we only focus on two modalities: text and image. Note that our method can be easily extended to other modalities.
We denote the training data as , where is the th image and is the corresponding text description of image . We also have a crossmodal similarity matrix , where means the th image and the th text are similar and means they are dissimilar.
The goal of crossmodal hashing is to learn two mapping functions to transform image and text into a common binary codes space respectively, in which the similarities between paired images and texts are preserved. Formally, Let and be denoted as the generated bit binary codes for image and text, respectively. If , the th image and the th text are similar. Hence, the Hamming distance between and should be small. When , the Hamming distance between them should be large.
3.2 Network Architecture
We propose HashGAN for crossmodal problem, which contains three type of networks: 1) feature learning networks for obtaining the highlevel representations of the multimodal data, 2) generative attention network for learning the attention distributions, and 3) discriminative hashing networks for learning the binary codes for crossmodal image retrieval.
3.2.1 Feature Learning Components: and
For image modality, the convolutional neural network is used to obtain the highlevel representation of images. In this paper, we use VGGNet [24] as the basic network to generate the feature maps as shown in Figure 3. Let be denoted as the image feature maps from the th raw image.
For text modality, we use multilayer perceptron (MLP) to obtain the powerful semantic representation of texts. Following DCMH
[6], we also use bagofwords (BOW) as the feature representation for text modality. There are two fullyconnected layers as shown in Figure 3. We denote as the feature vector for the th text.3.2.2 Generative Attention Components: and
With the powerful image feature maps and the text feature vector , we first feed them into one layer neural network, i.e., a convolutional layer with kernel size for image feature maps and a fullyconnected layer for text feature vector, and then followed by a softmax function and a threshold function to generate the attention distribution over the regions of multimodal data.
More specially, Figure 4 shows the pipeline in details for processing image modality. Suppose is the feature maps for the th image, where , and are the height, weight and channels of the feature maps, respectively. In the first step, the feature maps are mapped into the mask by a convolutional layer with kernel size. Next, the mask goes through a softmax layer and the output is denoted as , which is defined as
(1) 
where and denote the value in the th row and th column of the matrix and the matrix , respectively. The elements in
form a probability distribution, where
and .A larger value in correspond to the foregrounds and the backgrounds may have a smaller response. Thus, in the third step, we add a threshold layer to divide the data into the attended regions and the unattended regions, which is defined as
(2) 
where is a predefined threshold. We set in our experiment. The output of the threshold layer is a binary mask, with the elements inside be either 0 or 1. The regions with the value 1 are regarded as the foregrounds or the regions that are attend to, while other regions are regarded as background regions.
Based on the attention distribution, we can calculate the attentionaware feature maps and inattentionaware feature maps for the th image by multiplying the binary mask in elementwise, which is formulated as
(3)  
for all and . The foreground is and the background is . For ease of representation, we denote the whole procedures as .
For text modality, we imitate the pipeline similar to image modality which is shown in Figure 5. Since there are feature vectors rather than feature maps, we use fullyconnected layer instead of the convolutional layer, then it is fed to softmax and threshold respectively. Formally, we compute
(4)  
where and are two parameters in the fullyconnected layer, and is Kronecker product. We denote as the attentionaware and inattentionaware features for th text.
While taking the derivative of the threshold function directly is incompatible with the backpropagation in training. Specifically, suppose that
is the loss function, we need to use
in updating the network parameters by stochastic gradient descent (SGD) during backpropagation. However, the derivative
in the threshold layer is almost zero everywhere according to the definition of. Besides, by the chain rule:
, we can see that is also nearly zero immediately. Eventually, such an almost zerovalued node may block the backpropagation process.3.2.3 Discriminative Hashing Components: and
The discriminator networks encode the highlevel features for two modalities into binary codes.
Since we adopt VGGNet as our basic architecture, we simply use the last fullconnected layers, e.g., fc6 and fc7 ^{1}^{1}1The last fullyconnected layer (e.g., fc8) is removed since it is for classification problem. , to encode the images into binary codes. And then we add a fullyconnected layer with dimensional features which followed by a tanh layer that restricts the values in the range . Let the outputs of image discriminator network are and as the binary codes for the th attentionaware feature maps and inattentionaware feature maps, respectively.
For text modality, we also simply add a fullyconnected layer and a tanh layer to encode the text features into bits. Similar with image discriminator, and are denoted as the binary codes for the attentionaware and inattentionaware features, respectively.
3.3 Hashing Objectives
Our objective contains two types of terms: 1) crossmodal retrieval loss, which learns to keep the similarities between different modalities of data, 2) adversarial retrieval loss, generating the attention distribution.
3.3.1 Crossmodal Retrieval Loss
The aim of crossmodal loss function is to keep the similarities between images and texts. To keep the semantic similarities, intermodal ranking loss and intramodal ranking loss are used according to [8]. That is the hash codes from different modalities should preserve semantic similarity, and the hash codes from same modality should also preserve semantic similarity.
The crossmodal retrieval loss can be formulated as
(5) 
where is denoted as the modality is taken as the query to retrieve the relevant data of the modality where and . For example, means text queries are used to retrieve relevant images. We denote as the similarity preserving loss, and is the loss function for modality as query and modality as database. The first two terms are used to preserve the semantic similarity between different modalities, and the last two terms are to preserve the similarity in their own modality.
We take as an example for illustration. Given a binary code of the th text, good hash functions should require that the similar images should rank ahead of the dissimilar images. That is we should have in Hamming space when . Formally, can be defined as
(6)  
where is the triplet form. The objective is the triplet ranking loss [26] which show effectiveness in the unimodal retrieval.
Similar with that, can be defined as
(7) 
The can be formulated as
(8) 
and is
(9) 
3.3.2 Adversarial Retrieval Loss
Inspired by the impressive results in image generation of the generative adversarial network (GAN), we adopt it for generating the attention distribution. Similar with GAN, our method also has two models: generative attention model
and discriminative hashing model . Models is to preserve the semantic similarity between different modalities. While tries to generate attention distribution as described in Subsection 3.2.2. The inattentionaware features from should let fail to keep the semantic similarities. Hence, the adversarial loss can be expressed as(10)  
where and are the generated inattentionaware features. Note that and . The try to maximize the loss and is to minimize the objective.
(11) 
3.3.3 Full Objective
Our full objective is
Similar to GAN, we train our model alternatively. The parameters in and are fixed and other parameters are trainable:
(12) 
And then are fixed and update the generative attention models:
(13) 
4 Experiments
Task  Methods  IAPR TC12  MIRFlickr 25k  NUSWIDE  
16bits  32bits  64bits  16bits  32bits  64bits  16bits  32bits  64bits  
Text Query
Image Database 
CCA  0.3493  0.3438  0.3378  0.5742  0.5713  0.5691  0.3731  0.3661  0.3613 
CMFH  0.4168  0.4212  0.4277  0.6365  0.6399  0.6429  0.5031  0.5187  0.5225  
SCM  0.3453  0.3410  0.3470  0.6939  0.7012  0.7060  0.5344  0.5412  0.5484  
STMH  0.3687  0.3897  0.4044  0.6074  0.6153  0.6217  0.4471  0.4677  0.4780  
SePH  0.4423  0.4562  0.4648  0.7216  0.7261  0.7319  0.5983  0.6025  0.6109  
DCMH  0.5185  0.5378  0.5468  0.7827  0.7900  0.7932  0.6389  0.6511  0.6571  
Ours  0.5358  0.5565  0.5648  0.7922  0.8062  0.8074  0.6708  0.6875  0.6939  
Image Query
Text Database 
CCA  0.3422  0.3361  0.3300  0.5719  0.5693  0.5672  0.3742  0.3667  0.3617 
CMFH  0.4189  0.4234  0.4251  0.6377  0.6418  0.6451  0.4900  0.5053  0.5097  
SCM  0.3692  0.3666  0.3802  0.6851  0.6921  0.7003  0.5409  0.5485  0.5553  
STMH  0.3775  0.4002  0.4130  0.6132  0.6219  0.6274  0.4710  0.4864  0.4942  
SePH  0.4442  0.4563  0.4639  0.7123  0.7194  0.7232  0.6037  0.6136  0.6211  
DCMH  0.4526  0.4732  0.4844  0.7410  0.7465  0.7485  0.5903  0.6031  0.6093  
Ours  0.5293  0.5283  0.5439  0.7563  0.7719  0.7720  0.6300  0.6258  0.6468 
In this section, we evaluate the performance of our proposed methods on three datasets and compare it with several stageoftheart algorithms.
4.1 Datasets
We choose three characteristic public datasets for examination: IAPR TC12[27], MIRFlickr 25k[28] and NUSWIDE[29].
IAPR TC12 is a popular dataset for cross modal retrieval. It consists of 20,000 still natural images which are collected from wide domains, with at least one sentence description for each image. The imagetext pairs are multilabel, and 255 concept categories are set as the ground truth labels. In our experiment, we use the whole dataset. For image modality, we use the raw pixels directly, and for each text sample, we convert the sentence descriptions into 2912 dimensional bagofwords vectors.
MIRFlickr 25k includes 25,000 multilabel images that are downloaded from the photosharing website Flickr.com. The textual descriptions of each image are several words. Each instance holds one or more labels among 24 concept categories. In our experiment, we first get rid of the textual words counting less than 20 times, then the imagetext pairs lacking in textual words or labels are deleted from the original dataset. Afterwards, we have 20,015 instances. For image modality, we use raw pixels as before, while 1386 dimensional bagofwords vectors are used to indicate text points for text modality.
NUSWIDE is a widely used dataset for cross modal retrieval which consists of 269,648 multilabel images. Just as MIRFlickr, the textual representation of each image is several associated words as well. There are 81 concept categories provided for evaluation. In our experiment, we choose the imagetext pairs that belong to the 21 most frequent labels and 1,000 textual words, and the number of which is up to 195,834 subsequently. For image modality, we still use raw pixels, and 1000 dimensional bagofwords vectors are used for text modality meanwhile.
In order to establish the training and test sets, we choose 2,000 imagetext pairs in IAPR TC12 and MIRFlickr datasets randomly as test sets, or in other words, query sets. The rest instances form the retrieval sets. 10,000 random samples are selected from the retrieval set as our training sets. Besides, for NUSWIDE dataset, we select 2,100 imagetext pairs as the test or query set. The rest consists the retrieval set, while 10,500 random instances from the retrieval set become the training set. Table 2 shows the number of samples in each set intuitively.
IAPR TC12  MIRFlickr  NUSWIDE  
#Train  10000  10000  10500 
#Test  2000  2000  2100 
#Retrieval  18000  18015  193734 
4.2 Experimental Settings And Evaluation Measures
We implement our codes based on the open source caffe[30] framework. In training, the networks are updated alternatively through the stochastic gradient solver, i.e., ADAM (, ). We alternate between 4 steps of optimizing and 1 step of optimizing
. We initialize the VGGNet on the ImageNet dataset
[31]except the last layer. For text modality, all parameters are randomly initialized. The batch size is 64 and the total epoch is 100. The base learning rate is 0.005 and it is changed to one tenth of the current value after every 20 epochs. In testing, we use only the attentionaware features, i.e., foregrounds, of the data to construct the binary codes.
All the samples are ranked according to their Hamming distance from the query. To evaluate the performance of hashing models, we use two metrics: mean average precision (MAP)[32]
and precisionrecall curves. MAP is a standard evaluation metric for information retrieval, which is the mean of averaged precision over a set of queries.
4.3 Comparison with Stateoftheart Methods
Six stateoftheart crossmodal hashing approaches are selected as our baselines: CCA[33], CMFH[3], SCM[4], SMTH[34], SePH[5] and DCMH[6].
The comparison results of search accuracies on all of the three datasets are shown in Table 1. From the table we can see that our method outperforms other baselines and achieves excellent performances. For example, the MAP of our method is 0.5458 compared to 0.5185 of the second best algorithm DCMH. The precisionrecall curves are also shown in Figure 7. It can be seen that our method shows comparable performance over the existing baselines.
We also explore the effects of small network architecture in feature learning module for image modality since VGGNet is a large deep network. In this experiment, we select CNNF [35] as the basic network for the image modality. The comparison results are shown in Table 3. We can see that VGGNet performs better than CNNF while our method using CNNF also achieves good performance when compared to other stateoftheart baselines.
Task  Network  16bits  32bits  64bits 
VGG  0.5358  0.5565  0.5648  
CNNF  0.5267  0.5459  0.5538  
VGG  0.5293  0.5283  0.5439  
CNNF  0.5211  0.5168  0.5208 
The main reason for the good performance of our method is that we can obtain attention distribution for the multimodal data. Figure 8 shows some examples of the image modality. Note that it is hard to visualize the text modality (the networks for text modality use fullyconnected layers instead of CNN, and the order of words in BOW are changed), thus we do not show the masks learned in text network.
5 Conclusion
In the paper, we proposed a novel approach called HashGAN for the crossmodal hashing based on the idea of adversarial architecture. The proposed HashGAN contains three major components: feature learning module, generative attention module and the discriminative hashing module. The feature learning module learns powerful representations for multimodal data. The generator and discriminator play twoplayer minimax game, in which discriminator tries to minimize the similaritypreserving loss functions while generator aims to maximize the retrieval loss of the inattentionaware features. We performed our method on three datasets and the experimental results demonstrate the appealing performance of our method.
References
 [1] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang, “A comprehensive survey on crossmodal retrieval,” arXiv preprint arXiv:1607.06215, 2016.
 [2] Y. Zhen and D.Y. Yeung, “Coregularized hashing for multimodal data,” in NIPS, pp. 1376–1384, 2012.
 [3] D. Zhang and W.J. Li, “Largescale supervised multimodal hashing with semantic correlation maximization.,” in AAAI, vol. 1, p. 7, 2014.
 [4] G. Ding, Y. Guo, and J. Zhou, “Collective matrix factorization hashing for multimodal data,” in CVPR, pp. 2075–2082, 2014.
 [5] Z. Lin, G. Ding, M. Hu, and J. Wang, “Semanticspreserving hashing for crossview retrieval,” in CVPR, pp. 3864–3872, 2015.
 [6] Q. Y. Jiang and W. Li, “Deep crossmodal hashing,” in CVPR, 2016.
 [7] Y. Cao, M. Long, J. Wang, Q. Yang, and S. Y. Philip, “Deep visualsemantic hashing for crossmodal retrieval.,” in KDD, pp. 1445–1454, 2016.
 [8] E. Yang, C. Deng, W. Liu, X. Liu, D. Tao, and X. Gao, “Pairwise relationship guided deep hashing for crossmodal retrieval.,” in AAAI, pp. 1618–1625, 2017.
 [9] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, pp. 2048–2057, 2015.
 [10] D. Wang, P. Cui, M. Ou, and W. Zhu, “Learning compact hash codes for multimodal representations using orthogonal deep structure,” TMM, vol. 17, no. 9, pp. 1404–1416, 2015.
 [11] J. Masci, M. M. Bronstein, A. M. Bronstein, and J. Schmidhuber, “Multimodal similaritypreserving hashing,” TPAMI, vol. 36, no. 4, pp. 824–830, 2014.
 [12] Z. Yu, F. Wu, Y. Yang, Q. Tian, J. Luo, and Y. Zhuang, “Discriminative coupled dictionary hashing for fast crossmedia retrieval,” in SIGIR, pp. 395–404, 2014.

[13]
Y. Cao, M. Long, J. Wang, and H. Zhu, “Correlation autoencoder hashing for supervised crossmodal search,” in
ICMR, pp. 197–204, 2016.  [14] L. Sun, S. Ji, and J. Ye, “A least squares formulation for canonical correlation analysis,” in ICML, pp. 1024–1031, 2008.

[15]
R. He, W.S. Zheng, and B.G. Hu, “Maximum correntropy criterion for robust face recognition,”
TPAMI, vol. 33, no. 8, pp. 1561–1576, 2011.  [16] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recongnition with visual attention,” in ICLR, 2015.
 [17] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in CVPR, pp. 21–29, 2016.
 [18] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual attention,” arXiv preprint arXiv:1511.04119, 2015.
 [19] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, pp. 2672–2680, 2014.
 [20] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [21] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
 [22] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
 [23] J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang, “Irgan: A minimax game for unifying generative and discriminative information retrieval models,” arXiv preprint arXiv:1705.10513, 2017.
 [24] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [25] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1,” CoRR, vol. abs/1602.02830, 2016.
 [26] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in CVPR, pp. 3270–3278, 2015.
 [27] H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. LópezLópez, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger, “The segmented and annotated iapr tc12 benchmark,” CVIU, vol. 114, no. 4, pp. 419–428, 2010.
 [28] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,” in ICMIR, pp. 39–43, 2008.
 [29] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: a realworld web image database from national university of singapore,” in ICIVR, p. 48, 2009.
 [30] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
 [31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2014.
 [32] W. Liu, S. Kumar, S. Kumar, and S. F. Chang, “Discrete graph hashing,” in NIPS, pp. 3419–3427, 2014.
 [33] H. Hotelling, RELATIONS BETWEEN TWO SETS OF VARIATES. Springer New York, 1992.
 [34] D. Wang, X. Gao, X. Wang, and L. He, “Semantic topic multimodal hashing for crossmedia retrieval,” in ICAI, pp. 3890–3896, 2015.
 [35] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” Computer Science, 2014.
Comments
There are no comments yet.