HashGAN:Attention-aware Deep Adversarial Hashing for Cross Modal Retrieval

11/26/2017 ∙ by Xi Zhang, et al. ∙ 0

As the rapid growth of multi-modal data, hashing methods for cross-modal retrieval have received considerable attention. Deep-networks-based cross-modal hashing methods are appealing as they can integrate feature learning and hash coding into end-to-end trainable frameworks. However, it is still challenging to find content similarities between different modalities of data due to the heterogeneity gap. To further address this problem, we propose an adversarial hashing network with attention mechanism to enhance the measurement of content similarities by selectively focusing on informative parts of multi-modal data. The proposed new adversarial network, HashGAN, consists of three building blocks: 1) the feature learning module to obtain feature representations, 2) the generative attention module to generate an attention mask, which is used to obtain the attended (foreground) and the unattended (background) feature representations, 3) the discriminative hash coding module to learn hash functions that preserve the similarities between different modalities. In our framework, the generative module and the discriminative module are trained in an adversarial way: the generator is learned to make the discriminator cannot preserve the similarities of multi-modal data w.r.t. the background feature representations, while the discriminator aims to preserve the similarities of multi-modal data w.r.t. both the foreground and the background feature representations. Extensive evaluations on several benchmark datasets demonstrate that the proposed HashGAN brings substantial improvements over other state-of-the-art cross-modal hashing methods.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Main idea of HashGAN. Take task as an example, the and are similar while and are irrelevant, and denotes as the generator for generating attention masks while is our desired similarity-preserving hash functions. The two images and go through the generator , which divides the data into attended/foreground samples {, } and unattended/background samples {, }. The process of generative attention module is shown in (II). Finally, these four images and the query are fed into the discriminator . We train on discriminator and generator in an adversarial way (III): (1) the discriminator aims to learn the hash functions that preserve the similarities for both the foreground samples and the background samples, (2) the generator aims to generate attention masks that make discriminator cannot preserve the similarities of the background samples.
Figure 2: Overview of HashGAN. Above is image modality branch, and below is text modality branch. Each branch is divided into three parts: feature learning ( and ), generative attention ( and ) and discriminative hashing ( and ). The feature learnings map the input multi-modal data into high-level features representations. Then the generators learn the attention masks for these features representations. The attended (foreground) features and the unattended (background) features are generated via the attention masks. Finally, discriminators encode all features into binary codes and learn similarity-preserved hash functions. We train the discriminator and generator alternately, which generators maximize the retrieval loss of background features for generating good masks while discriminators minimize the error for both foreground features and background features for obtaining efficient binary codes.

Due to the fast development of Internet, different types of media data grow rapidly, e.g., texts, images and videos. These different types of data may describe the same events or topics. For example, the photos in Flickr are allowed users to give interactive comments. Hence, developing a retrieval model for multi-modal data is a desired requirement. Cross-modal retrieval, which takes one type of data as the query and return the relevant data of another type, is receiving increasing attention since it is a natural searching way for multi-modal data. The solution methods can be roughly divided into two categories [1]: real-valued representation learning and binary representation learning. Since the low storage cost and fast retrieval speed of the binary representation, we only focus on cross-modal binary representation learning (i.e., Hashing) in this paper.

To date, various cross-modal hashing algorithms [2, 3, 4, 5, 6, 7, 8]

have been proposed for embedding correlations among different modalities of data. In the cross-modal hashing procedure, feature extraction is considered as the first step for representing all modalities of data, and then one project these multi-modal features into a common Hamming space for future search. Many methods 

[4, 3] use shallow architecture for feature extraction. For example, collective matrix factorization hashing (CMFH) [4] and semantic correlation maximization (SCM) [3]

use the hand-crafted features. Recently, deep learning has also been adopted for cross-modal hashing due to its powerful ability of learning good representations of data. The representative works of deep-network-based cross-modal hashing includes deep cross-modal hashing (DCMH) 

[6], deep visual-semantic hashing (DVSH) [7], pairwise relationship guided deep hashing (PRDH) [8] and so on.

In parallel, the computational model of “attention” has drawn much interest due to its impressive result in various applications, e.g., image caption [9]. It is also desired for cross-modal retrieval problem. For example, as shown in Figure 1, given a query girl sits on donkey, if we can locate the more informative objects in image (e.g., the black regions), the more accuracy can be obtained. To the best of our knowledge, the attention mechanism has not been well explored for cross-modal hashing.

In this paper, we propose an adversarial hashing network with attention mechanism for cross-modal hashing. Ideally, good attention masks should locate discriminative regions, which also mean the unattended regions of data are uninformative and hard to preserve similarities. Hence, in our proposed network, adaptive attention masks are generated for the multi-modal data, then the learned masks divide the data into attended samples(only keep foregrounds of the data) and unattended samples(only keep backgrounds of the data). Hinging on such attention masks, a good discriminative hashing should preserve the similarities for both the foreground samples (which can be viewed as easy examples) and background samples (hard examples) for enhancing the robustness and performance of the learned hash functions. And the good generator should generate attention masks that make discriminator cannot preserve the similarities of the background samples, for unattended regions of data should not be discriminative.

Based on this, we present a new adversarial model called HashGAN, which is illustrated in Figure 2 and consists of three major components: (1) feature learning module which uses CNN or MLP to extract high level semantic representations for the multi-modal data, (2) generative attention module which generates the adaptive attention masks and divides the feature representations into the attended and the unattended feature representations, and (3) discriminative hashing module which focus on learning the binary codes for the multi-modal data. HashGAN trains two adversarial networks alternatively: the discriminator is learned to preserve the similarities for both the easy foreground feature representations and the hard background feature representations, while the generator learns to produce masks that make the discriminator fails to keep similarities of the background feature representation. The adversarial retrieval loss and cross-modal retrieval loss are proposed to obtain good attention masks and powerful hash functions.

The main contributions of our work are three-fold. First, we propose an attention-aware method for cross-modal hashing problem, which is able to detect the informative regions of multi-modal data. Second, we propose an HashGAN for learning effective attention masks and compact binary codes simultaneously. Third, we quantitatively evaluate the usefulness of attention in cross-modal hashing and our method yields better performances by comparing with several state-of-the-art methods.

2 Related Work

2.1 Cross-modal Hashing

According to the utilized information for learning the common representations, cross-modal hashing can be categorized into three groups [1]: 1) unsupervised methods [10], 2) pairwise based methods [11, 2] and 3) supervised methods [12, 13]. The unsupervised methods only use co-occurrence information to learn hash functions for multi-modal data. For instance, cross-view hashing (CVH) [14] extends spectral hashing from uni-modal to multi-modal scenario. The pairwise based methods use both the co-occurrence information and similar/dissimilar pairs to learn the hash functions. Bronstein et al. [15] proposed cross-modal similarity sensitive hashing (CMSSH), which learn the hash functions to ensure that if two samples (with different modalities) are relevant/irrelevant, their corresponding binary codes are similar/dissimilar. The supervised methods exploit label information to learn more discriminative common representation. Semantic correlation maximization (SCM) [3]

uses label vector to obtain the similarity matrix and reconstruct it through the binary codes.

However, most of these works are based on hand-crafted features. Recently, deep learning methods show that they can effectively discover the correlations across different modalities. The most representative work is deep cross-modal hashing (DCMH) [6]. DCMH integrates feature learning and hash-code learning into the same framework. Cao et.al. [7]

proposed deep visual-semantic hashing (DVSH), which utilizes both the convolutional neural network (CNN) and long short term memory (LSTM) to separately learn the common representations for each modality. Pairwise relationship guided deep hashing (PRDH) 

[8] also adopts deep CNN models to learn feature representations and hash codes simultaneously.

However, all these methods encode an entire data point into a binary representation. Few works attend to introduce attention mechanism into cross-modal hashing.

2.2 Attention Models

Attention-aware methods capture where the model should focus on when performing a particular task. The attention mechanism has been proved to be very powerful in many applications, such as image classification [16], image caption [9], image question answering [17], video action recognition [18] and etc. For example, Xu et al. [9] proposed two forms of attention for image caption: a “hard” attention mechanism trained by REINFORCE and a “soft” attention mechanism trained by standard back-propagation methods. Stacked attention networks (SANs) [17] take multiple steps to progressively focus the attention on the relevant regions and lead to a better answer for image QA. Sharma et al. [18]

proposed a soft attention based model for action recognition which uses recurrent neural networks (RNNs) with long short-term memory (LSTM) unit to obtain both the spatial and temporal information.

2.3 Generative Adversarial Network

Generative adversarial networks (GANs) have been received a lot of interest in generative modelling problems. The original GAN [19] train two models: the discriminative model and the generative model . The discriminative model learns to determine whether a sample is from the model distribution or data distribution. The generative model attempts to produce a sample that can fake the discriminative model.

Recently, several approaches have been proposed to improve the original GAN. For example, DCGAN [20], CGAN [21] and Wasserstein GAN [22]. IRGAN [23] is a recently proposed method for information retrieval, in which the generative retrieval focusing on predicting relevant documents and the discriminative retrieval focusing on predicting relevancy given a query document pair. Different from our method, IRGAN is designed for uni-modal retrieval and it is not an attention-aware method, yet.

In this paper, we extend GAN to cross-modal hashing. We carefully design a new GAN, called HashGAN, to generate attention-aware common representations and to learn similarity-preserve hash functions.

3 HashGAN

3.1 Problem Definition

Suppose there are training samples, each of which is represented in several modalities, e.g., audio, video, image, text, etc. In this paper, we only focus on two modalities: text and image. Note that our method can be easily extended to other modalities.

We denote the training data as , where is the -th image and is the corresponding text description of image . We also have a cross-modal similarity matrix , where means the -th image and the -th text are similar and means they are dissimilar.

The goal of cross-modal hashing is to learn two mapping functions to transform image and text into a common binary codes space respectively, in which the similarities between paired images and texts are preserved. Formally, Let and be denoted as the generated -bit binary codes for image and text, respectively. If , the -th image and the -th text are similar. Hence, the Hamming distance between and should be small. When , the Hamming distance between them should be large.

3.2 Network Architecture

We propose HashGAN for cross-modal problem, which contains three type of networks: 1) feature learning networks for obtaining the high-level representations of the multi-modal data, 2) generative attention network for learning the attention distributions, and 3) discriminative hashing networks for learning the binary codes for cross-modal image retrieval.

3.2.1 Feature Learning Components: and

For image modality, the convolutional neural network is used to obtain the high-level representation of images. In this paper, we use VGGNet [24] as the basic network to generate the feature maps as shown in Figure 3. Let be denoted as the image feature maps from the -th raw image.

Figure 3: Feature learning module for image modality and text modality .

For text modality, we use multi-layer perceptron (MLP) to obtain the powerful semantic representation of texts. Following DCMH 

[6], we also use bag-of-words (BOW) as the feature representation for text modality. There are two fully-connected layers as shown in Figure 3. We denote as the feature vector for the -th text.

3.2.2 Generative Attention Components: and

With the powerful image feature maps and the text feature vector , we first feed them into one layer neural network, i.e., a convolutional layer with kernel size for image feature maps and a fully-connected layer for text feature vector, and then followed by a softmax function and a threshold function to generate the attention distribution over the regions of multi-modal data.

More specially, Figure 4 shows the pipeline in details for processing image modality. Suppose is the feature maps for the -th image, where , and are the height, weight and channels of the feature maps, respectively. In the first step, the feature maps are mapped into the mask by a convolutional layer with kernel size. Next, the mask goes through a softmax layer and the output is denoted as , which is defined as


where and denote the value in the -th row and -th column of the matrix and the matrix , respectively. The elements in

form a probability distribution, where

and .

A larger value in correspond to the foregrounds and the backgrounds may have a smaller response. Thus, in the third step, we add a threshold layer to divide the data into the attended regions and the unattended regions, which is defined as


where is a predefined threshold. We set in our experiment. The output of the threshold layer is a binary mask, with the elements inside be either 0 or 1. The regions with the value 1 are regarded as the foregrounds or the regions that are attend to, while other regions are regarded as background regions.

Based on the attention distribution, we can calculate the attention-aware feature maps and inattention-aware feature maps for the -th image by multiplying the binary mask in element-wise, which is formulated as


for all and . The foreground is and the background is . For ease of representation, we denote the whole procedures as .

Figure 4: Binary mask generated by in the image branch.

For text modality, we imitate the pipeline similar to image modality which is shown in Figure 5. Since there are feature vectors rather than feature maps, we use fully-connected layer instead of the convolutional layer, then it is fed to softmax and threshold respectively. Formally, we compute


where and are two parameters in the fully-connected layer, and is Kronecker product. We denote as the attention-aware and inattention-aware features for -th text.

Figure 5: Binary mask generated by in the textual branch.

While taking the derivative of the threshold function directly is incompatible with the back-propagation in training. Specifically, suppose that

is the loss function, we need to use

in updating the network parameters by stochastic gradient descent (SGD) during back-propagation. However, the derivative

in the threshold layer is almost zero everywhere according to the definition of

. Besides, by the chain rule:

, we can see that is also nearly zero immediately. Eventually, such an almost zero-valued node may block the back-propagation process.

To address this issue, we follow the idea proposed in [25]

, which uses the straight-through estimator to estimate or propagate the gradients of the

threshold function. That is to say, we ignore the derivative , and set by as an estimator.

3.2.3 Discriminative Hashing Components: and

The discriminator networks encode the high-level features for two modalities into binary codes.

Figure 6: Discriminative hashing networks for image modality and tex modality .

Since we adopt VGGNet as our basic architecture, we simply use the last full-connected layers, e.g., fc6 and fc7 111The last fully-connected layer (e.g., fc8) is removed since it is for classification problem. , to encode the images into binary codes. And then we add a fully-connected layer with dimensional features which followed by a tanh layer that restricts the values in the range . Let the outputs of image discriminator network are and as the binary codes for the -th attention-aware feature maps and inattention-aware feature maps, respectively.

For text modality, we also simply add a fully-connected layer and a tanh layer to encode the text features into bits. Similar with image discriminator, and are denoted as the binary codes for the attention-aware and inattention-aware features, respectively.

3.3 Hashing Objectives

Our objective contains two types of terms: 1) cross-modal retrieval loss, which learns to keep the similarities between different modalities of data, 2) adversarial retrieval loss, generating the attention distribution.

3.3.1 Cross-modal Retrieval Loss

The aim of cross-modal loss function is to keep the similarities between images and texts. To keep the semantic similarities, inter-modal ranking loss and intra-modal ranking loss are used according to  [8]. That is the hash codes from different modalities should preserve semantic similarity, and the hash codes from same modality should also preserve semantic similarity.

The cross-modal retrieval loss can be formulated as


where is denoted as the modality is taken as the query to retrieve the relevant data of the modality where and . For example, means text queries are used to retrieve relevant images. We denote as the similarity preserving loss, and is the loss function for modality as query and modality as database. The first two terms are used to preserve the semantic similarity between different modalities, and the last two terms are to preserve the similarity in their own modality.

We take as an example for illustration. Given a binary code of the -th text, good hash functions should require that the similar images should rank ahead of the dissimilar images. That is we should have in Hamming space when . Formally, can be defined as


where is the triplet form. The objective is the triplet ranking loss [26] which show effectiveness in the uni-modal retrieval.

Similar with that, can be defined as


The can be formulated as


and is


3.3.2 Adversarial Retrieval Loss

Inspired by the impressive results in image generation of the generative adversarial network (GAN), we adopt it for generating the attention distribution. Similar with GAN, our method also has two models: generative attention model

and discriminative hashing model . Models is to preserve the semantic similarity between different modalities. While tries to generate attention distribution as described in Subsection 3.2.2. The inattention-aware features from should let fail to keep the semantic similarities. Hence, the adversarial loss can be expressed as


where and are the generated inattention-aware features. Note that and . The try to maximize the loss and is to minimize the objective.


3.3.3 Full Objective

Our full objective is

Similar to GAN, we train our model alternatively. The parameters in and are fixed and other parameters are trainable:


And then are fixed and update the generative attention models:


4 Experiments

  Task   Methods   IAPR TC-12   MIR-Flickr 25k   NUS-WIDE  
16bits 32bits 64bits   16bits 32bits 64bits   16bits 32bits 64bits  
  Text Query

Image Database
CCA   0.3493 0.3438 0.3378   0.5742 0.5713 0.5691   0.3731 0.3661 0.3613  
CMFH   0.4168 0.4212 0.4277   0.6365 0.6399 0.6429   0.5031 0.5187 0.5225  
SCM   0.3453 0.3410 0.3470   0.6939 0.7012 0.7060   0.5344 0.5412 0.5484  
STMH   0.3687 0.3897 0.4044   0.6074 0.6153 0.6217   0.4471 0.4677 0.4780  
SePH   0.4423 0.4562 0.4648   0.7216 0.7261 0.7319   0.5983 0.6025 0.6109  
DCMH   0.5185 0.5378 0.5468   0.7827 0.7900 0.7932   0.6389 0.6511 0.6571  
Ours   0.5358 0.5565 0.5648   0.7922 0.8062 0.8074   0.6708 0.6875 0.6939  
  Image Query

Text Database
CCA   0.3422 0.3361 0.3300   0.5719 0.5693 0.5672   0.3742 0.3667 0.3617  
CMFH   0.4189 0.4234 0.4251   0.6377 0.6418 0.6451   0.4900 0.5053 0.5097  
SCM   0.3692 0.3666 0.3802   0.6851 0.6921 0.7003   0.5409 0.5485 0.5553  
STMH   0.3775 0.4002 0.4130   0.6132 0.6219 0.6274   0.4710 0.4864 0.4942  
SePH   0.4442 0.4563 0.4639   0.7123 0.7194 0.7232   0.6037 0.6136 0.6211  
DCMH   0.4526 0.4732 0.4844   0.7410 0.7465 0.7485   0.5903 0.6031 0.6093  
Ours   0.5293 0.5283 0.5439   0.7563 0.7719 0.7720   0.6300 0.6258 0.6468  
Table 1: Comparison about MAP on two cross modal retrieval tasks w.r.t different lengths of hash codes.

In this section, we evaluate the performance of our proposed methods on three datasets and compare it with several stage-of-the-art algorithms.

4.1 Datasets

We choose three characteristic public datasets for examination: IAPR TC-12[27], MIR-Flickr 25k[28] and NUS-WIDE[29].

IAPR TC-12 is a popular dataset for cross modal retrieval. It consists of 20,000 still natural images which are collected from wide domains, with at least one sentence description for each image. The image-text pairs are multi-label, and 255 concept categories are set as the ground truth labels. In our experiment, we use the whole dataset. For image modality, we use the raw pixels directly, and for each text sample, we convert the sentence descriptions into 2912 dimensional bag-of-words vectors.

MIR-Flickr 25k includes 25,000 multi-label images that are downloaded from the photo-sharing website Flickr.com. The textual descriptions of each image are several words. Each instance holds one or more labels among 24 concept categories. In our experiment, we first get rid of the textual words counting less than 20 times, then the image-text pairs lacking in textual words or labels are deleted from the original dataset. Afterwards, we have 20,015 instances. For image modality, we use raw pixels as before, while 1386 dimensional bag-of-words vectors are used to indicate text points for text modality.

NUS-WIDE is a widely used dataset for cross modal retrieval which consists of 269,648 multi-label images. Just as MIR-Flickr, the textual representation of each image is several associated words as well. There are 81 concept categories provided for evaluation. In our experiment, we choose the image-text pairs that belong to the 21 most frequent labels and 1,000 textual words, and the number of which is up to 195,834 subsequently. For image modality, we still use raw pixels, and 1000 dimensional bag-of-words vectors are used for text modality meanwhile.

In order to establish the training and test sets, we choose 2,000 image-text pairs in IAPR TC-12 and MIR-Flickr datasets randomly as test sets, or in other words, query sets. The rest instances form the retrieval sets. 10,000 random samples are selected from the retrieval set as our training sets. Besides, for NUS-WIDE dataset, we select 2,100 image-text pairs as the test or query set. The rest consists the retrieval set, while 10,500 random instances from the retrieval set become the training set. Table 2 shows the number of samples in each set intuitively.

    IAPR TC-12   MIR-Flickr   NUS-WIDE  
  #Train   10000   10000   10500  
  #Test   2000   2000   2100  
  #Retrieval   18000   18015   193734  
Table 2: The number of samples in each dataset.

4.2 Experimental Settings And Evaluation Measures

(a) Query from text to image task. ()
(b) Query from image to text task. ()
Figure 7: precision-recall curves on three datasets. The length of hash code is 16.

We implement our codes based on the open source caffe[30] framework. In training, the networks are updated alternatively through the stochastic gradient solver, i.e., ADAM  (, ). We alternate between 4 steps of optimizing and 1 step of optimizing

. We initialize the VGGNet on the ImageNet dataset 


except the last layer. For text modality, all parameters are randomly initialized. The batch size is 64 and the total epoch is 100. The base learning rate is 0.005 and it is changed to one tenth of the current value after every 20 epochs. In testing, we use only the attention-aware features, i.e., foregrounds, of the data to construct the binary codes.

All the samples are ranked according to their Hamming distance from the query. To evaluate the performance of hashing models, we use two metrics: mean average precision (MAP)[32]

and precision-recall curves. MAP is a standard evaluation metric for information retrieval, which is the mean of averaged precision over a set of queries.

Figure 8: Some image and mask samples. The first line are original images, The masks are in the middle. The combinations are shown in the bottom.

4.3 Comparison with State-of-the-art Methods

Six state-of-the-art cross-modal hashing approaches are selected as our baselines: CCA[33], CMFH[3], SCM[4], SMTH[34], SePH[5] and DCMH[6].

The comparison results of search accuracies on all of the three datasets are shown in Table 1. From the table we can see that our method outperforms other baselines and achieves excellent performances. For example, the MAP of our method is 0.5458 compared to 0.5185 of the second best algorithm DCMH. The precision-recall curves are also shown in Figure 7. It can be seen that our method shows comparable performance over the existing baselines.

We also explore the effects of small network architecture in feature learning module for image modality since VGGNet is a large deep network. In this experiment, we select CNN-F [35] as the basic network for the image modality. The comparison results are shown in Table 3. We can see that VGGNet performs better than CNN-F while our method using CNN-F also achieves good performance when compared to other state-of-the-art baselines.

  Task   Network  16bits 32bits 64bits 
    VGG  0.5358 0.5565 0.5648 
CNN-F  0.5267 0.5459 0.5538 
    VGG  0.5293 0.5283 0.5439 
CNN-F  0.5211 0.5168 0.5208 
Table 3: MAP on IAPR TC-12 dataset with different networks.

The main reason for the good performance of our method is that we can obtain attention distribution for the multi-modal data. Figure 8 shows some examples of the image modality. Note that it is hard to visualize the text modality (the networks for text modality use fully-connected layers instead of CNN, and the order of words in BOW are changed), thus we do not show the masks learned in text network.

5 Conclusion

In the paper, we proposed a novel approach called HashGAN for the cross-modal hashing based on the idea of adversarial architecture. The proposed HashGAN contains three major components: feature learning module, generative attention module and the discriminative hashing module. The feature learning module learns powerful representations for multi-modal data. The generator and discriminator play two-player minimax game, in which discriminator tries to minimize the similarity-preserving loss functions while generator aims to maximize the retrieval loss of the inattention-aware features. We performed our method on three datasets and the experimental results demonstrate the appealing performance of our method.