SSAH: Semi-supervised Adversarial Deep Hashing with Self-paced Hard Sample Generation

by   Sheng Jin, et al.
Harbin Institute of Technology

Deep hashing methods have been proved to be effective and efficient for large-scale Web media search. The success of these data-driven methods largely depends on collecting sufficient labeled data, which is usually a crucial limitation in practical cases. The current solutions to this issue utilize Generative Adversarial Network (GAN) to augment data in semi-supervised learning. However, existing GAN-based methods treat image generations and hashing learning as two isolated processes, leading to generation ineffectiveness. Besides, most works fail to exploit the semantic information in unlabeled data. In this paper, we propose a novel Semi-supervised Self-pace Adversarial Hashing method, named SSAH to solve the above problems in a unified framework. The SSAH method consists of an adversarial network (A-Net) and a hashing network (H-Net). To improve the quality of generative images, first, the A-Net learns hard samples with multi-scale occlusions and multi-angle rotated deformations which compete against the learning of accurate hashing codes. Second, we design a novel self-paced hard generation policy to gradually increase the hashing difficulty of generated samples. To make use of the semantic information in unlabeled ones, we propose a semi-supervised consistent loss. The experimental results show that our method can significantly improve state-of-the-art models on both the widely-used hashing datasets and fine-grained datasets.



There are no comments yet.


page 3

page 7


A Novel Semi-Supervised Data-Driven Method for Chiller Fault Diagnosis with Unlabeled Data

In practical chiller systems, applying efficient fault diagnosis techniq...

SSDH: Semi-supervised Deep Hashing for Large Scale Image Retrieval

Hashing methods have been widely used for efficient similarity retrieval...

SCH-GAN: Semi-supervised Cross-modal Hashing by Generative Adversarial Network

Cross-modal hashing aims to map heterogeneous multimedia data into a com...

Pairwise Teacher-Student Network for Semi-Supervised Hashing

Hashing method maps similar high-dimensional data to binary hashcodes wi...

Generalized Product Quantization Network for Semi-supervised Hashing

Learning to hash has achieved great success in image retrieval due to it...

Brain Stroke Lesion Segmentation Using Consistent Perception Generative Adversarial Network

Recently, the state-of-the-art deep learning methods have demonstrated i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In the big data era, large-scale image retrieval is widely used in many practical applications, yet it remains a challenge because of the large computational cost and high accuracy requirement. To address the efficiency and effectiveness issues, hashing methods have become a hot research topic. A great number of hashing methods are proposed to map images into a hamming space, including

traditional hashing methods [1, 18, 17] and deep hashing methods [5, 4, 19, 28]. Compared with traditional ones, deep hashing methods usually achieve better retrieval performance due to its powerful ability of feature representation and nonlinear mapping.

Figure 1: To obtain the optimal boundary for points with similar hashing codes, we propose a novel self-paced deep adversarial hashing to generate hard samples, as shown in (b). Intuitively, these samples can help the network learn more optimal classification boundaries.

Although great efforts have been devoted to deep learning-based algorithms, the label-hungry property makes it intractable in practice. Contrarily, for some retrieval tasks, unlabeled data is always enough. To make use of unlabeled data, several semi-supervised methods are proposed, include

graph-based methods like SSDH [37] and BGDH [36], and generation methods like DSH-GANs [25], HashGAN [3] and SSGAH [32]. Graph-based works like SSDH and BGDH use graph structure to mine unlabeled data. However, constructing the graph model of large-scale data is expensive computation and time-consuming, and using batch data instead may lead to a suboptimal result. Currently, GAN [8] is proved to be effective in generation tasks and then this novel technical are introduced into hashing. Existing GAN-based methods restricted by two crucial problems, i.e., generation ineffectiveness and unlabeled-data underutilization.

In terms of generation ineffectiveness, the existing GAN-based methods train the generation network solely based on label information. This setting leads to ineffective generations that are either too hard or easy for hashing code learning, which unable to match the dynamic training of the hashing network. In terms of unlabeled-data underutilization, most existing works like DSH-GANs [25] only exploit unlabeled data to synthesize high-quality images, while the unsupervised data is not utilized when learning hashing codes. We argue that, the above two issues are not independent. In particular, the ineffecitive generation policy makes triplet-wise methods like SSGAH [32] failed to make the most of unlabeled data, since these algorithms heavily depend on hard triplets.

In this paper, we propose a novel deep hashing method as a solid solution for generation ineffectiveness and unlabeled-data underutilization termed semi-supervised self-paced deep adversarial hashing (SSAH). The main idea of SSAH is depicted in Figure 1.

To tackle generation ineffectiveness, first, our method tries to generate proper hard images to gradually improve hashing learning. The generation network is designed to produce hard samplesL111We define the samples which are difficult for current retrieval as hard samples. with multi-scale masks and multi-angle rotations. Second, we apply the key idea of SPL222The SPL theory [15] is inspired by the learning process of human, where samples are involved in learning from easy to gradually complex ones. to our framework aiming to control the hard generations in the dynamic training procedure. To tackle unlabeled-data underutilization, we propose a consistent loss by encouraging consistent binary codes for the input image (both labeled and unlabeled data) and its corresponding hard generations.

Specially, SSAH

consists of an adversarial generation network (A-Net) and a hashing network (H-Net). The loss function contains four components, including a self-paced adversarial loss, a semi-supervised consistent loss, a supervised semantic loss, and a quantization loss, which guide the training of two networks in an adversarial manner. The A-Net learns deformations and masks to generate hard images, where the quality of these generative images are evaluated by the H-Net. Then the H-Net is trained by both the input images and the generated hard samples. In the test phase, we only use the H-Net to produce hashing codes.

The main contributions of SSAH are three-fold:

  • To generate samples properly, we propose a novel hashing framework by integrating the self-paced adversarial mechanism into hard generations and hashing codes learning. Our generation network takes both masking and deformation into account.

  • To make better use of unlabeled data, a novel consistent loss is proposed to exploit semantic information of all the data in a semi-supervised manner.

  • Experimental results on both general and fine-grained datasets demonstrate the superior performance of our method in comparison with many state-of-art methods.

Related Work

We introduce the most related works from two aspects: Semi-supervised Hashing and Hard Example Generation.

Figure 2: An illustration of self-paced adversarial hashing (SSAH). SSAH is comprised of two main components: (1) an adversarial network (A-Net) for hard sample generation, (2) a hashing network (H-Net) based on AlexNet for learning hashing codes. In the step1, the A-Net takes as input the training images and pairwise similarity to learn a hard mask. This mask is learned under the criterion that the hashing codes of masked image pairs become contrary to their pairwise label. In the step2, the H-Net take both training images and the generated hard samples as input to learn more accurate binary codes. In the training stage, hard generation and hashing codes are learned in an adversarial way.

Semi-supervised Hashing Based on whether labeled data is used in the training process, hashing methods can be divided into unsupervised [21], semi-supervised [36], and supervised ones [35]. Semi-supervised hashing is effective when a small amount of labeled data and enough unlabeled data is available. SSH [33] is proposed to minimize the empirical error over labeled data and maximize the information entropy of binary codes over both labeled and unlabeled data. However, SSH is a traditional shallow method, which leads to unsatisfying performance compared with deep hashing methods like SSDH [37], BGDH [36], which are discussed in the introduction.

Very recently, some semi-supervised hashing methods are proposed, which use GAN [8] to augment data. Deep Semantic Hashing (DSH-GANs) [25] is the first hashing method that introduces GANs into hashing. But it can only incorporate pointwise label which is often unavailable in online image retrieval applications. Cao et al. propose a novel conditional GANs based on pairwise supervised information, named HashGAN [3] to solve the insufficient sample problem. However, the sample generation is independent of hashing codes learning. Wang et al. [32] propose SSGAH which utilizes triplet-labels and specifically designs a GANs model which can be well learned with limited supervised information. For cross-model hashing, Zhang et al. [38] mines the attention region in an adversarial way. However, all the mentioned GAN-based methods try to generate as much as possible images, which is not an effective and even feasible solution in most cases.

Hard Example Generation.

Hard example generation is currently used in training deep models effectively for many computer vision tasks, including object detection

[34], retrieval [10], and other tasks [23]. Zhong et al. [39] propose a parameter-learning free method, termed Random Erasing which randomly selects a rectangle region in an image and erases its pixels with random values. However, hard example generations of Random Erasing is still isolated with network training. The generated images may not be consistent with the dynamic training status. Very recently, Wang et al. [34] introduce the adversarial mechanism to synthesize hard samples. This method incorporates pointwise supervised information, e.g. class labels, which is often unavailable in online image retrieval applications. Different from previous methods, we propose a novel architecture using the pairwise label to generate hard samples for learning better hashing codes. What’s more, our proposed generation networks learn hard samples in a self-paced adversarial learning manner.

Semi-supervised Self-paced Adversarial Hash

Given a dataset consist of labeled and unlabeled data. Since each labeled image in the dataset owns a unique class label, image pairs can be further labeled as similar or dissimilar . where denotes the pairwise label of images (, ). Our goal is to learn more accurate hashing codes in a semi-supervised way. To this end, we will present Semi-supervised Self-paced Adversarial Hashing (SSAH) in details. Since discrete optimization is difficult to be solved by deep networks, we firstly ignore the binary constraint and concentrate on binary-like code . Then we obtain from . Figure  2 illustrates an overview of our architecture, which consists of an A-Net for hard samples generation and an H-Net for binary codes learning. The generation network takes labeled and unlabeled images as inputs and produces hard masks and rotated-related parameters as the outputs. Then the H-Net learns compact binary hashing codes from both generative hard samples and training images.

Adversarial Network

The A-Net is designed to generate hard images. Our method generates hard images in two main methods. The first method is to change the rotation angle and scale of the whole image. Here we propose Multi-Angle Rotation Network, termed MARN. The second method is to generate masks to change the value of the pixel. Here we propose Multi-Scale Mask Network, termed MSMN.

Multi-Angle Rotation Network. Motivated by STN [11], we propose the MARN to create multi-angle rotations on the training images. The reason is we want to preserve the location correspondence between image pixels and landmark coordinates. Otherwise, we might hurt the localization accuracy once the intermediate feature maps are disturbed. However, the Single-angle RN will often predict the largest angle. To improve the diversity of generated images, the MARN is designed to produce hard generations with size , where is the scale of images . Each generated hard samples are required in different ranges of angles. More specially, the rotation degree of the first image is constrained within clockwise and anti-clockwise. The -th generated image is constrained within clockwise and anticlockwise. For each input image , the MARN produces hard generated images, which is denoted as .

Multi-Scale Mask Network. The object of MSMN is to produce multi-scale masks to make training images harder. In the hashing learning stage, we can obtain the convolutional features from different layers in H-Net. These features represent different spatial scales region of the original image. Correspondingly, we generate multi-scale additive and multiplicative masks for these selected features.

The framework of MSMN is shown in Figure 3. Specially, for each selected convolutional layer , we extract features with size , where is the spatial dimension and represents the number of channels. Given this feature, our MSMN predicts an additive hard mask and a multiplicative hard mask .

We use the sigmoid function as the activation function for additive masks and tanh function for multiplicative masks. The value of

with is in range of and that of is in range of . The corresponding features of hard samples, which is denoted as , is obtained by


When the value of is 0, represents the images .

Hashing Learning Network

We directly adopt a pre-trained AlexNet [14] as the base model of the H-Net. The raw image pixel, from either the original images and the generated images, is the input of the hashing model. The learned additive masks and multiplicative masks are only required in the coding process of generated images . The hard feature maps are computed according to Eq (1). For each generated images , are learned:


The output layer of AlexNet is replaced by a hashing layer where the dimension is defined based on the length of the required hashing code.

Figure 3: The framework of MSMN, where denotes multi-scale masks for the selected layer.

Loss Functions

In this section, we design the adversarial loss function, including a self-paced adversarial loss and a supervised semantic loss, for the A-Net to guide the generations of hard samples. Besides, a hashing loss function for the H-Net, including the supervised semantic loss and the semi-supervised consistent loss, is proposed to learn better codes by capturing semantic information of all data.

Self-paced Adversarial Loss. Most existing GAN based hashing methods try to generate images to augment data. However, these methods can not ensure the quality of generative samples, which may obtain bad samples: (1) too difficult or too easy, (2) unable to match the dynamic training.

To improve the effectiveness of generations, in this paper, we leverage the H-Net to guide the training of the A-Net by a novel definition termed hard degree. Firstly, we compute the similarity degree of image pairs (, ) by using , where

represents binary-like codes of the input images. The distance of the pairwise label and the estimated similarity degree is denoted as



can be obtained similarly, where represents the binary-like codes of their corresponding generated images. .

Since the hard samples learned by the A-Net may increase the H-Net loss, the value of is required to be larger than that of . The hard degree of generative image pairs (, ) is defined by using the difference between and , which can be formulated as:


We adopt a positive dynamic margin to define the self-paced adversarial loss, which can be concluded as:


where is a fixed constant.

Discussion. The self-paced adversarial loss has two merits: considering inter-pair difference and generating hard samples gradually. In terms of inter-pair difference, the value of is large, when original image pairs is hard to distinguish. In this situation, the hard degree of is not necessarily large. By contrast, if is easy to be distinguished, the hard degree of need to be relatively large. To meet difference requirements, adapts the margin of hard degree. In terms of hard generations policy, with the training of H-Net, better codes are learned and then become small gradually. can be used to obtain a larger margin, leading to harder generations.

Similarly, we use as cross-domain image pairs, where is the input image and is its corresponding hard samples. The hard degree of (, ) is define as , which can be formulated as:


Then self-paced adversarial loss of (, ) is defined as , where the constant is used in Eq (5). The whole self-paced adversarial loss is written as:


Supervised Semantic Loss. Similar to other hashing methods, we adopt a pairwise semantic loss to ensure the hashing codes preserve relevant class information. The similarity degree of image pairs is computed by . Then the value of similarity degree is required to near the pairwise label . Since the H-Net is trained by both labeled image and their corresponding hard generations, the supervised semantic loss can be written as:


Besides, when training the A-Net, the supervised semantic loss is also adopted as a regular term to preserve class information of the generative hard samples.

Semi-supervised Consistent Loss. In some related computer vision tasks like semi-supervised classification, pseudo-ensembles methods222These methods develop from the cognitive ability of human. When a percept is changed slightly, a human typically still consider it to be the same object. [2, 16, 29] are proved to be effective. These methods encourage consistent networks output for each image with and without noise. Motivated by the success of these works, we propose a novel consistent loss to improve the utilization of unlabeled data.

However, compared with existing pseudo-ensembles methods, which always adopt random and data-independent noise, our proposed method designs the A-Net to generate more proper noise for the inputs, including multi-scale masks and multi-angle rotation. Then the H-Net is required to learn consistent binary codes of the training images (including labeled and unlabeled images) and their corresponding hard samples , by taking the hamming distance between the hashing codes and . The consistent loss can be formulated as:


Quantization Loss. Due to the ignorant of the binary constraint, a widely-used quantization loss is used to pull the value of and that of together, which is written as:

0:  Training set and their corresponding class labels.
0:  H-Net function: and A-Net function: .
1:  For the entire training set, construct the pairwise label matrix .
2:  for  epoch do
3:     Compute to Eq (2).
4:     Fixing , update according to Eq (11).
5:     Fixing , update according to Eq (12).
6:  end for
7:  return  , .
Algorithm 1 Self-paced Deep Adversarial Hashing

Alternating Optimization

Our network consists of two sub-networks: an adversarial network, termed A-Net, for hard image generation and a hashing network, termed H-Net, for compact hashing codes learning. As shown in Algorithm. 1, we train the A-Net and the H-Net iteratively. The overall training objective of the A-Net integrates the semantic loss defined in Eq (8), the self-paced adversarial loss of three types of image pairs defined in Eq (7) and the quantization loss defined in Eq (10). The A-Net is trained by the following loss:


By minimizing this term, the A-Net is trained to generate proper hard samples, leading to better hashing codes.

For the shared hashing model, the H-Net is trained by the input data, including labeled and unlabeled images, and their corresponding hard genrative samples. The supervised semantic loss, semi-supervised consistent loss, and the quantization loss are used to train the H-Net. We update its parameters according to the following overall loss:


By minimizing this term, the shared H-Net is trained to learn effective hashing codes.


To test the performance of our proposed SSAH method, we conduct experiments on general hashing datasets, i.e. CIFAR-10 and NUS-WIDE, to verify the effectiveness of our method. Then we conduct experiments on two fine-grained datasets, i.e. CUB Bird and Stanford Dogs-120, to prove that our method is still robust and effective for more complex fine-grained retrieval tasks. We also conduct some analytical experiments to further verify our method.

12 bits 24 bits 48 bits 12 bits 24 bits 48 bits
ITQ-CCA 0.435 0.435 0.435 0.526 0.575 0.594
KSH 0.556 0.572 0.588 0.618 0.651 0.682
SDH 0.558 0.596 0.614 0.645 0.688 0.711
CNNH 0.439 0.476 0.489 0.611 0.618 0.608
HashGAN 0.655 0.709 0.727 0.708 0.722 0.730
SSDH 0.801 0.813 0.814 0.773 0.779 0.778
BGDH 0.805 0.824 0.833 0.803 0.818 0.828
SSGAH 0.819 0.837 0.855 0.835 0.847 0.865
SSAH 0.862 0.878 0.886 0.872 0.884 0.898
Table 1: The mAP scores for different number of bits on CIFAR-10 and NUSWIDE datasets.


We conduct our experiments on two general datasets, namely CIFAR-10 and NUSWIDE. CIFAR-10 [13] is a small image dataset including 60k 32 32 images in 10 classes. Each image belongs to one class (6000 images per class). NUS-WIDE [6] contains nearly 270k images with 81 semantic concepts. For NUS-WIDE, we follow [21] to use the images associated with the 21 most frequent concepts, where each of these concepts associated with at least 5,000 images. Following [21, 32], we randomly sample 100 images per class as the test set, and the others are as a database. In the training process, we randomly sample 500 images per class from the database as labeled data, and the others are as unlabeled data.

We further verify our experiments on two widely-used fine-grained datasets, namely Stanford Dogs-120 and CUB Bird. Stanford Dogs-120 [22] dataset consists of 20,580 images in 120 mutually classes. Each class contains about 150 images. CUB Bird [31] includes 11,788 images in mutually 200 classes. We directly use test set defined in these datasets. The train set is used as a database. In the training process, we randomly sample 50% images per class s from the database as labeled data, and the others are as unlabeled data.

Comparative Methods and Evaluation Metrics

For the general datasets, including CIFAR-10 and NUSWIDE dataset, we compare our method (SSAH) with three supervised deep hashing methods: CNNH [35], HashGAN [3], three semi-supervised deep hashing methods: SSDH [37], BGDH [36], SSGAH [32] and three shallow methods: ITQ-CCA [7], KSH [20], SDH [27]. For the fine-grained datasets, our method is further compared with DSaH [28], which is the first hashing method designed for fine-grained retrieval.

For a fair comparison between traditional and deep hashing methods, we conduct these methods on features extracted from the fc7 layer of AlexNet which is pre-trained on ImageNet. For deep hashing methods, we use as input the original images, and adopt AlexNet

[14] as the backbone architecture.

Evaluation Metric. We use Mean Average Precision (mAP) for quantitative evaluation. The Precision@top-N curves and Precision-Recall curves are shown in supple materials.

Implementation Details

Network Design. As shown in Figure 2, our model consists of an A-Net including MSMN, MARN, and a H-Net. For MSMN, we adopt a lightweight version of generator in GANimation [24]

. This network contains a stride-2 convolution, three residual blocks

[9], and a 1/2-strided convolution. Similar to Johnson et al. [12], we use instance normalization [30]. The figurations of MSMN are shown in the supplematerials. MARN is built upon the STN [11]

. Different from STN, our transformer network produces

sets of affine parameters. We adopt AlexNet [14] as the encoder of H-Net, fine-tune all layers but the last one are copied from the pre-trained AlexNet.

Trainind Details. Our SSAH

is implemented on PyTorch and the deep model is trained by batch gradient descent. In the training stage, images are regarded as input in the form of batch and every two images in the same batch construct an image pair. Practically, we train A-Net before H-Net. If we first train H-Net, A-Net might output semantic-irrelevant hard generated images, which would be a bad sample and guide the training of hashing model in the wrong direction.

Network Parameters. The value of hyper-parameter is 1.0, is 0.5, is 0.5 and

is 0.1. We use the mini-batch stochastic gradient descent with 0.9 momentum. We set the value of the margin parameters

as , which increases every 5 epochs. The mini-batch size of images is fixed as 32 and the weight decay parameter as 0.0005. The value of the number of the rotated hard samples is 3.

Quantitative Results

Performance on general hashing datasets. The Mean Average Precision (mAP,%) results of different methods for different numbers of bits on NUSWIDE and CIFAR-10 dataset are shown in Table 1. Experimental results show that SSAH outperforms state-of-the-art SSGAH [32] by , on CIFAR10, and NUSWIDE, respectively. According to the experimental results, SSAH can be seen to be more effective for traditional hashing task.

Performance on fine-grained hashing datasets. The fine-grained retrieval task requires methods describing fine-grained objects that share similar overall appearance but have a subtle difference. To meet this requirement, there will be greater demand for collecting and annotating data, where professional knowledge is required in some cases. Since it is more difficult to generate fine-grained objects, GAN-based hashing methods are also not effective due to the scarcity of supervised data. However, the experimental results show that our method is still robust to this task.

The mAP results of different methods on fine-grained datasets are shown in Table 2. The proposed SSAH method substantially outperforms all the comparison methods. Compared with existing best retrieval performance (DSaH (Alexnet)), SSAH achieves absolute increases of , on CUB Brid datasets and on Stanford dog datasets, respectively. Compared with the mAP results on traditional hashing task, our method is proved to achieve a significant improvement in fine-grained retrieval. What’s more, we only use the H-Net to produce binary codes in the test phase and the DSaH method need to highlight the salient regions before the encoding process. Thus, our method is also more efficient in time.

Dataset Stanford Dogs-120 CUB Bird
12 bits 24 bits 48 bits 12 bits 24 bits 48 bits
HashGAN 0.029 0.172 0.283 0.020 0.0542 0.123
SSGAH 0.127 0.233 0.329 0.073 0.1321 0.247
DSaH 0.244 0.287 0.408 0.091 0.2087 0.285
SSAH 0.273 0.343 0.478 0.141 0.265 0.359
Table 2: The mAP scores for different number of bits on Stanford Dogs-120 and CUB Bird datasets.
12 bits 24 bits 48 bits 12 bits 24 bits 48 bits
ITQ-CCA 0.157 0.165 0.201 0.488 0.493 0.503
SDH 0.185 0.193 0.213 0.471 0.490 0.507
CNNH 0.210 0.225 0.231 0.445 0.463 0.477
DRSCH 0.219 0.223 0.251 0.457 0.464 0.360
NINH 0.241 0.249 0.272 0.484 0.483 0.487
SSDH 0.285 0.291 0.325 0.510 0.533 0.551
SSGAH 0.309 0.323 0.339 0.539 0.553 0.579
SSAH 0.338 0.370 0.379 0.569 0.571 0.596
Table 3: The mAP scores under retrieval of unseen classes on CIFAR-10 and NUSWIDE datasets.

Performance of Unseen Classes. To further evaluate our SSAH approach, we adopt the evaluation protocol from [26]. In the training process, 75% of classes (termed set 75) are known, and the remaining 25% classes (termed set 25) are used to for evaluation. The set 75 and set 25 are further divided into the training set and test set. Data amount in train and test set are the same. Following settings in [37], we use train75 as the training set and test25 as the test set. For general hashing retrieval, the set75 of CIFAR-10 and NUS- WIDE consist of 7 classes and 15 classes respectively, results are calculated by the average of 5 different random splits, mAP scores are calculated based on all returned images.

The mAP scores under the retrieval of unseen classes are shown in Table 3. Our SSAH method achieves the best result when retrieving unseen classes, which means that our method achieves better generalization performance to unseen class. The experimental results on fine-grained datasets are shown in supple materials.

Ablation Study

To further verify our method, we conduct some analysis experiments including: (1) the effectiveness of hard samples generation, (2) the analysis of each loss component, (3) the effectiveness of self-paced generation policy.

Component Analysis of the Network. We compare our MARN and MSMN with random image rotation/random mask generation strategy in training stage using the AlexNet architecture. (1) Random Image Rotation: For each image, we obtain three rotated images, where each image are rotated in the specified angle range. (2) Random Mask Generation: The values of multiplicative masks are in range of and that of additive masks are in range of . The values of multiplicative and additive masks are required in the range of .

We report our results for using MARN and MSMN in Table 4. For the AlexNet architecture, the mAP of our implemented baseline is and on CIFAR-10 and NUS-WIDE datasets. Based on this setting, joint learning with our MARN model improves baseline by and , respectively on these datasets. Joint learning with the MSMN model improves baseline by and . As both methods are complementary to each other, combining MARN and MSMN into our model gives another boost to and on CIFAR-10 and NUS-WIDE datasets, respectively.

12 bits 48 bits 12 bits 48 bits
baseline 0.751 0.792 0.762 0.801
random rotate 0.788 0.810 0.797 0.828
+MARN 0.801 0.831 0.810 0.842
random mask 0.792 0.813 0.806 0.811
+MSMN () 0.811 0.838 0819 0.832
+MSMN () 0.820 0.847 0.831 0.849
+MSMN 0.830 0.853 0.840 0.861
random rotate+mask 0.796 0.816 0.810 0.833
Ours(full) 0.862 0.886 0.872 0.898
Table 4: The mAP scores of SSAH using different network components (MARN and MSMN).

Component Analysis of the Loss Functions. Our loss function consists of three major components: , and . includes that of hard and cross-domain image pairs, which are denoted as and . To evaluate the contribution of each loss, we study the effect of different loss combinations on retrieval performance. From Table 5, when we use , and to train H-Net, and and to train A-Net, the retrieval performance is best. For the H-Net, the self-paced adversarial loss may destroy the training procedure of accurate binary codes. Combined with , our method further improves about , which shows H-Net capture the semantic information of unlabeled data. We also show some typical visualization results of hard masks using different loss components of the A-Net in Figure 4.

As shown in Figure 4, for the A-Net, if we only use the semantic loss, the generated mask would avoid the object. if we only use the self-paced adversarial loss, the generated mask occludes the object in some cases. However, using the combination of and can obtain a proper mask. The object is partially occluded and it still can be recognized.

Figure 4: Typical Examples of the learned mask using different loss components on fine-grained dataset.
A-Net H-Net CIFAR-10
12 bits 24 bits 48 bits
0.809 0.823 0.85
0.821 0.843 0.862
0.836 0.850 0.870
0.848 0.862 0.872
0.841 0.855 0.867
0.862 0.878 0.886
0.843 0.852 0.8612
Table 5: The mAP scores of SSAH on CIFAR-10 dataset using different combinations of loss functions.

The effectiveness of Self-paced Hard Generation Policy. In this section, we evaluate SSAH on the impact of generation policy by self-paced vs. fixed-paced loss. To define the fixed-paced loss, the dynamic margin in Eq (5) is replace by a fixed parameter . The fixed-paced loss is formulated as .

As shown in Table  6, compare with fixed-paced loss, the self-paced adversarial loss is more effective and also robust to the margin parameter . A possible reason is that the fixed-paced loss can not match the dynamic training procedure, and balance hard pairs and simple ones.

Margin Parameters Self-paced Loss Fixed-paced Loss
12 bits 48 bits 12 bits 48 bits
0.01 0.790 0.803 0.791 0.815
0.05 0.843 0.855 0.819 0.842
0.3 0.855 0.873 0.838 0.852
0.5 0.846 0.861 0.789 0.813
1.0 0.849 0.852 0.775 0.801
Ours(0.1) 0.862 0.886 0.839 0.861
Table 6: The mAP scores of SSAH on CIFAR-10 using self-paced loss and fixed-paced loss with different margin .


To solve the data insufficiency problem, we propose a Semi-supervised Self-paced Adversarial Hashing (SSAH) method, consisting of an adversarial network (A-Net) and a hashing network (H-Net). To exploit the semantic information in images, and their corresponding hard generative images, we adopt a supervised semantic loss and a novel semi-supervised consistent loss to train the H-Net. Then the H-Net is used to guide the training of A-Net by a novel self-paced adversarial loss to produce multi-scale masks and some sets of deformation parameters. The A-Net and the H-Net are trained iteratively in an adversarial way. Extensive experimental results demonstrate the effectiveness of SSAH.

Acknowledgements This work was supported by the National Natural Science Foundation of China under Project No. 61772158, 61702136, and U1711265.


  • [1] A. Andoni and P. Indyk (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS’06. 47th Annual IEEE Symposium on, Cited by: Introduction.
  • [2] P. Bachman, O. Alsharif, and D. Precup (2014) Learning with pseudo-ensembles. In Nips, pp. 3365–3373. Cited by: Loss Functions.
  • [3] Y. Cao, B. Liu, M. Long, J. Wang, and M. KLiss (2018) HashGAN: deep learning to hash with pair conditional wasserstein gan. In CVPR, pp. 1287–1296. Cited by: Introduction, Related Work, Comparative Methods and Evaluation Metrics.
  • [4] Y. Cao, M. Long, L. Bin, and J. Wang (2018) Deep cauchy hashing for hamming space retrieval. In CVPR, Cited by: Introduction.
  • [5] Y. Cao, M. Long, J. Wang, and S. Liu (2017) Deep visual-semantic quantization for efficient image retrieval. In CVPR, Cited by: Introduction.
  • [6] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In MM, Cited by: Datasets.
  • [7] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on PAMI 35 (12), pp. 2916–2929. Cited by: Comparative Methods and Evaluation Metrics.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Nips, pp. 2672–2680. Cited by: Introduction, Related Work.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Implementation Details.
  • [10] H. Huang, D. Li, Z. Zhang, X. Chen, and K. Huang (2018) Adversarially occluded samples for person re-identification. In CVPR, Cited by: Related Work.
  • [11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Nips, pp. 2017–2025. Cited by: Adversarial Network, Implementation Details.
  • [12] J. Johnson, A. Alahi, and L. Fei-Fei (2016)

    Perceptual losses for real-time style transfer and super-resolution

    In ECCV, pp. 694–711. Cited by: Implementation Details.
  • [13] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto. Cited by: Datasets.
  • [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    In Nips, Cited by: Hashing Learning Network, Comparative Methods and Evaluation Metrics, Implementation Details.
  • [15] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Nips, pp. 1189–1197. Cited by: footnote 2.
  • [16] S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: Loss Functions.
  • [17] M. Lin, R. Ji, H. Liu, X. Sun, Y. Wu, and Y. Wu (2019) Towards optimal discrete online hashing with balanced similarity. arXiv preprint arXiv:1901.10185. Cited by: Introduction.
  • [18] M. Lin, R. Ji, H. Liu, and Y. Wu (2018) Supervised online hashing via hadamard codebook learning. In MM, Cited by: Introduction.
  • [19] B. Liu, Y. Cao, M. Long, J. Wang, and J. Wang (2018) Deep triplet quantization. MM, ACM. Cited by: Introduction.
  • [20] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang (2012) Supervised hashing with kernels. In CVPR, Cited by: Comparative Methods and Evaluation Metrics.
  • [21] W. Liu, J. Wang, S. Kumar, and S. Chang (2011) Hashing with graphs. In

    Proceedings of the 28th international conference on machine learning

    pp. 1–8. Cited by: Related Work, Datasets.
  • [22] M. Nilsback and A. Zisserman (2006) A visual vocabulary for flower classification. In CVPR, Vol. 2, pp. 1447–1454. Cited by: Datasets.
  • [23] X. Peng, Z. Tang, F. Yang, R. S. Feris, and D. Metaxas (2018)

    Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation

    In CVPR, pp. 2226–2234. Cited by: Related Work.
  • [24] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer (2018) Ganimation: anatomically-aware facial animation from a single image. In ECCV, pp. 818–833. Cited by: Implementation Details.
  • [25] Z. Qiu, Y. Pan, T. Yao, and T. Mei (2017) Deep semantic hashing with generative adversarial networks. In ACM SIGIR, pp. 225–234. Cited by: Introduction, Introduction, Related Work.
  • [26] A. Sablayrolles, M. Douze, N. Usunier, and H. Jégou (2017) How should we evaluate supervised hashing?. In ICASSP, Cited by: Quantitative Results.
  • [27] F. Shen, C. Shen, W. Liu, and H. Tao Shen (2015) Supervised discrete hashing. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 37–45. Cited by: Comparative Methods and Evaluation Metrics.
  • [28] J. Sheng, S. Xiaoshuai, Y. Hongxun, z. Shangchen, Z. Lei, and H. Xiansheng (2018) Deep saliency hashing for fine-grained retrieval. arXiv preprint arXiv:1807.01459. Cited by: Introduction, Comparative Methods and Evaluation Metrics.
  • [29] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Nips, Cited by: Loss Functions.
  • [30] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: Implementation Details.
  • [31] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, CIT. Cited by: Datasets.
  • [32] G. Wang, Q. Hu, J. Cheng, and Z. Hou (2018) Semi-supervised generative adversarial hashing for image retrieval. In ECCV, Cited by: Introduction, Introduction, Related Work, Datasets, Comparative Methods and Evaluation Metrics, Quantitative Results.
  • [33] J. Wang, S. Kumar, and S. Chang (2010) Semi-supervised hashing for scalable image retrieval. In CVPR, Cited by: Related Work.
  • [34] X. Wang, A. Shrivastava, and A. Gupta (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In CVPR, Cited by: Related Work.
  • [35] R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1, pp. 2156–2162. Cited by: Related Work, Comparative Methods and Evaluation Metrics.
  • [36] X. Yan, L. Zhang, and W. Li (2017) Semi-supervised deep hashing with a bipartite graph.. In IJCAI, pp. 3238–3244. Cited by: Introduction, Related Work, Comparative Methods and Evaluation Metrics.
  • [37] J. Zhang and Y. Peng (2017) SSDH: semi-supervised deep hashing for large scale image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 29 (1), pp. 212–225. Cited by: Introduction, Related Work, Comparative Methods and Evaluation Metrics, Quantitative Results.
  • [38] X. Zhang, H. Lai, and J. Feng (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In ECCV, Cited by: Related Work.
  • [39] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: Related Work.