Introduction
In the big data era, large-scale image retrieval is widely used in many practical applications, yet it remains a challenge because of the large computational cost and high accuracy requirement. To address the efficiency and effectiveness issues, hashing methods have become a hot research topic. A great number of hashing methods are proposed to map images into a hamming space, including
traditional hashing methods [1, 18, 17] and deep hashing methods [5, 4, 19, 28]. Compared with traditional ones, deep hashing methods usually achieve better retrieval performance due to its powerful ability of feature representation and nonlinear mapping.
Although great efforts have been devoted to deep learning-based algorithms, the label-hungry property makes it intractable in practice. Contrarily, for some retrieval tasks, unlabeled data is always enough. To make use of unlabeled data, several semi-supervised methods are proposed, include
graph-based methods like SSDH [37] and BGDH [36], and generation methods like DSH-GANs [25], HashGAN [3] and SSGAH [32]. Graph-based works like SSDH and BGDH use graph structure to mine unlabeled data. However, constructing the graph model of large-scale data is expensive computation and time-consuming, and using batch data instead may lead to a suboptimal result. Currently, GAN [8] is proved to be effective in generation tasks and then this novel technical are introduced into hashing. Existing GAN-based methods restricted by two crucial problems, i.e., generation ineffectiveness and unlabeled-data underutilization.In terms of generation ineffectiveness, the existing GAN-based methods train the generation network solely based on label information. This setting leads to ineffective generations that are either too hard or easy for hashing code learning, which unable to match the dynamic training of the hashing network. In terms of unlabeled-data underutilization, most existing works like DSH-GANs [25] only exploit unlabeled data to synthesize high-quality images, while the unsupervised data is not utilized when learning hashing codes. We argue that, the above two issues are not independent. In particular, the ineffecitive generation policy makes triplet-wise methods like SSGAH [32] failed to make the most of unlabeled data, since these algorithms heavily depend on hard triplets.
In this paper, we propose a novel deep hashing method as a solid solution for generation ineffectiveness and unlabeled-data underutilization termed semi-supervised self-paced deep adversarial hashing (SSAH). The main idea of SSAH is depicted in Figure 1.
To tackle generation ineffectiveness, first, our method tries to generate proper hard images to gradually improve hashing learning. The generation network is designed to produce hard samplesL111We define the samples which are difficult for current retrieval as hard samples. with multi-scale masks and multi-angle rotations. Second, we apply the key idea of SPL222The SPL theory [15] is inspired by the learning process of human, where samples are involved in learning from easy to gradually complex ones. to our framework aiming to control the hard generations in the dynamic training procedure. To tackle unlabeled-data underutilization, we propose a consistent loss by encouraging consistent binary codes for the input image (both labeled and unlabeled data) and its corresponding hard generations.
Specially, SSAH
consists of an adversarial generation network (A-Net) and a hashing network (H-Net). The loss function contains four components, including a self-paced adversarial loss, a semi-supervised consistent loss, a supervised semantic loss, and a quantization loss, which guide the training of two networks in an adversarial manner. The A-Net learns deformations and masks to generate hard images, where the quality of these generative images are evaluated by the H-Net. Then the H-Net is trained by both the input images and the generated hard samples. In the test phase, we only use the H-Net to produce hashing codes.
The main contributions of SSAH are three-fold:
-
To generate samples properly, we propose a novel hashing framework by integrating the self-paced adversarial mechanism into hard generations and hashing codes learning. Our generation network takes both masking and deformation into account.
-
To make better use of unlabeled data, a novel consistent loss is proposed to exploit semantic information of all the data in a semi-supervised manner.
-
Experimental results on both general and fine-grained datasets demonstrate the superior performance of our method in comparison with many state-of-art methods.
Related Work
We introduce the most related works from two aspects: Semi-supervised Hashing and Hard Example Generation.

Semi-supervised Hashing Based on whether labeled data is used in the training process, hashing methods can be divided into unsupervised [21], semi-supervised [36], and supervised ones [35]. Semi-supervised hashing is effective when a small amount of labeled data and enough unlabeled data is available. SSH [33] is proposed to minimize the empirical error over labeled data and maximize the information entropy of binary codes over both labeled and unlabeled data. However, SSH is a traditional shallow method, which leads to unsatisfying performance compared with deep hashing methods like SSDH [37], BGDH [36], which are discussed in the introduction.
Very recently, some semi-supervised hashing methods are proposed, which use GAN [8] to augment data. Deep Semantic Hashing (DSH-GANs) [25] is the first hashing method that introduces GANs into hashing. But it can only incorporate pointwise label which is often unavailable in online image retrieval applications. Cao et al. propose a novel conditional GANs based on pairwise supervised information, named HashGAN [3] to solve the insufficient sample problem. However, the sample generation is independent of hashing codes learning. Wang et al. [32] propose SSGAH which utilizes triplet-labels and specifically designs a GANs model which can be well learned with limited supervised information. For cross-model hashing, Zhang et al. [38] mines the attention region in an adversarial way. However, all the mentioned GAN-based methods try to generate as much as possible images, which is not an effective and even feasible solution in most cases.
Hard Example Generation.
Hard example generation is currently used in training deep models effectively for many computer vision tasks, including object detection
[34], retrieval [10], and other tasks [23]. Zhong et al. [39] propose a parameter-learning free method, termed Random Erasing which randomly selects a rectangle region in an image and erases its pixels with random values. However, hard example generations of Random Erasing is still isolated with network training. The generated images may not be consistent with the dynamic training status. Very recently, Wang et al. [34] introduce the adversarial mechanism to synthesize hard samples. This method incorporates pointwise supervised information, e.g. class labels, which is often unavailable in online image retrieval applications. Different from previous methods, we propose a novel architecture using the pairwise label to generate hard samples for learning better hashing codes. What’s more, our proposed generation networks learn hard samples in a self-paced adversarial learning manner.Semi-supervised Self-paced Adversarial Hash
Given a dataset consist of labeled and unlabeled data. Since each labeled image in the dataset owns a unique class label, image pairs can be further labeled as similar or dissimilar . where denotes the pairwise label of images (, ). Our goal is to learn more accurate hashing codes in a semi-supervised way. To this end, we will present Semi-supervised Self-paced Adversarial Hashing (SSAH) in details. Since discrete optimization is difficult to be solved by deep networks, we firstly ignore the binary constraint and concentrate on binary-like code . Then we obtain from . Figure 2 illustrates an overview of our architecture, which consists of an A-Net for hard samples generation and an H-Net for binary codes learning. The generation network takes labeled and unlabeled images as inputs and produces hard masks and rotated-related parameters as the outputs. Then the H-Net learns compact binary hashing codes from both generative hard samples and training images.
Adversarial Network
The A-Net is designed to generate hard images. Our method generates hard images in two main methods. The first method is to change the rotation angle and scale of the whole image. Here we propose Multi-Angle Rotation Network, termed MARN. The second method is to generate masks to change the value of the pixel. Here we propose Multi-Scale Mask Network, termed MSMN.
Multi-Angle Rotation Network. Motivated by STN [11], we propose the MARN to create multi-angle rotations on the training images. The reason is we want to preserve the location correspondence between image pixels and landmark coordinates. Otherwise, we might hurt the localization accuracy once the intermediate feature maps are disturbed. However, the Single-angle RN will often predict the largest angle. To improve the diversity of generated images, the MARN is designed to produce hard generations with size , where is the scale of images . Each generated hard samples are required in different ranges of angles. More specially, the rotation degree of the first image is constrained within clockwise and anti-clockwise. The -th generated image is constrained within clockwise and anticlockwise. For each input image , the MARN produces hard generated images, which is denoted as .
Multi-Scale Mask Network. The object of MSMN is to produce multi-scale masks to make training images harder. In the hashing learning stage, we can obtain the convolutional features from different layers in H-Net. These features represent different spatial scales region of the original image. Correspondingly, we generate multi-scale additive and multiplicative masks for these selected features.
The framework of MSMN is shown in Figure 3. Specially, for each selected convolutional layer , we extract features with size , where is the spatial dimension and represents the number of channels. Given this feature, our MSMN predicts an additive hard mask and a multiplicative hard mask .
We use the sigmoid function as the activation function for additive masks and tanh function for multiplicative masks. The value of
with is in range of and that of is in range of . The corresponding features of hard samples, which is denoted as , is obtained by(1) |
When the value of is 0, represents the images .
Hashing Learning Network
We directly adopt a pre-trained AlexNet [14] as the base model of the H-Net. The raw image pixel, from either the original images and the generated images, is the input of the hashing model. The learned additive masks and multiplicative masks are only required in the coding process of generated images . The hard feature maps are computed according to Eq (1). For each generated images , are learned:
(2) |
The output layer of AlexNet is replaced by a hashing layer where the dimension is defined based on the length of the required hashing code.

Loss Functions
In this section, we design the adversarial loss function, including a self-paced adversarial loss and a supervised semantic loss, for the A-Net to guide the generations of hard samples. Besides, a hashing loss function for the H-Net, including the supervised semantic loss and the semi-supervised consistent loss, is proposed to learn better codes by capturing semantic information of all data.
Self-paced Adversarial Loss. Most existing GAN based hashing methods try to generate images to augment data. However, these methods can not ensure the quality of generative samples, which may obtain bad samples: (1) too difficult or too easy, (2) unable to match the dynamic training.
To improve the effectiveness of generations, in this paper, we leverage the H-Net to guide the training of the A-Net by a novel definition termed hard degree. Firstly, we compute the similarity degree of image pairs (, ) by using , where
represents binary-like codes of the input images. The distance of the pairwise label and the estimated similarity degree is denoted as
:(3) |
can be obtained similarly, where represents the binary-like codes of their corresponding generated images. .
Since the hard samples learned by the A-Net may increase the H-Net loss, the value of is required to be larger than that of . The hard degree of generative image pairs (, ) is defined by using the difference between and , which can be formulated as:
(4) |
We adopt a positive dynamic margin to define the self-paced adversarial loss, which can be concluded as:
(5) |
where is a fixed constant.
Discussion. The self-paced adversarial loss has two merits: considering inter-pair difference and generating hard samples gradually. In terms of inter-pair difference, the value of is large, when original image pairs is hard to distinguish. In this situation, the hard degree of is not necessarily large. By contrast, if is easy to be distinguished, the hard degree of need to be relatively large. To meet difference requirements, adapts the margin of hard degree. In terms of hard generations policy, with the training of H-Net, better codes are learned and then become small gradually. can be used to obtain a larger margin, leading to harder generations.
Similarly, we use as cross-domain image pairs, where is the input image and is its corresponding hard samples. The hard degree of (, ) is define as , which can be formulated as:
(6) |
Then self-paced adversarial loss of (, ) is defined as , where the constant is used in Eq (5). The whole self-paced adversarial loss is written as:
(7) |
Supervised Semantic Loss. Similar to other hashing methods, we adopt a pairwise semantic loss to ensure the hashing codes preserve relevant class information. The similarity degree of image pairs is computed by . Then the value of similarity degree is required to near the pairwise label . Since the H-Net is trained by both labeled image and their corresponding hard generations, the supervised semantic loss can be written as:
(8) | ||||
Besides, when training the A-Net, the supervised semantic loss is also adopted as a regular term to preserve class information of the generative hard samples.
Semi-supervised Consistent Loss. In some related computer vision tasks like semi-supervised classification, pseudo-ensembles methods222These methods develop from the cognitive ability of human. When a percept is changed slightly, a human typically still consider it to be the same object. [2, 16, 29] are proved to be effective. These methods encourage consistent networks output for each image with and without noise. Motivated by the success of these works, we propose a novel consistent loss to improve the utilization of unlabeled data.
However, compared with existing pseudo-ensembles methods, which always adopt random and data-independent noise, our proposed method designs the A-Net to generate more proper noise for the inputs, including multi-scale masks and multi-angle rotation. Then the H-Net is required to learn consistent binary codes of the training images (including labeled and unlabeled images) and their corresponding hard samples , by taking the hamming distance between the hashing codes and . The consistent loss can be formulated as:
(9) |
Quantization Loss. Due to the ignorant of the binary constraint, a widely-used quantization loss is used to pull the value of and that of together, which is written as:
(10) |
Alternating Optimization
Our network consists of two sub-networks: an adversarial network, termed A-Net, for hard image generation and a hashing network, termed H-Net, for compact hashing codes learning. As shown in Algorithm. 1, we train the A-Net and the H-Net iteratively. The overall training objective of the A-Net integrates the semantic loss defined in Eq (8), the self-paced adversarial loss of three types of image pairs defined in Eq (7) and the quantization loss defined in Eq (10). The A-Net is trained by the following loss:
(11) |
By minimizing this term, the A-Net is trained to generate proper hard samples, leading to better hashing codes.
For the shared hashing model, the H-Net is trained by the input data, including labeled and unlabeled images, and their corresponding hard genrative samples. The supervised semantic loss, semi-supervised consistent loss, and the quantization loss are used to train the H-Net. We update its parameters according to the following overall loss:
(12) |
By minimizing this term, the shared H-Net is trained to learn effective hashing codes.
Experiment
To test the performance of our proposed SSAH method, we conduct experiments on general hashing datasets, i.e. CIFAR-10 and NUS-WIDE, to verify the effectiveness of our method. Then we conduct experiments on two fine-grained datasets, i.e. CUB Bird and Stanford Dogs-120, to prove that our method is still robust and effective for more complex fine-grained retrieval tasks. We also conduct some analytical experiments to further verify our method.
Dataset | CIFAR-10 | NUSWIDE | ||||
---|---|---|---|---|---|---|
12 bits | 24 bits | 48 bits | 12 bits | 24 bits | 48 bits | |
ITQ-CCA | 0.435 | 0.435 | 0.435 | 0.526 | 0.575 | 0.594 |
KSH | 0.556 | 0.572 | 0.588 | 0.618 | 0.651 | 0.682 |
SDH | 0.558 | 0.596 | 0.614 | 0.645 | 0.688 | 0.711 |
CNNH | 0.439 | 0.476 | 0.489 | 0.611 | 0.618 | 0.608 |
HashGAN | 0.655 | 0.709 | 0.727 | 0.708 | 0.722 | 0.730 |
SSDH | 0.801 | 0.813 | 0.814 | 0.773 | 0.779 | 0.778 |
BGDH | 0.805 | 0.824 | 0.833 | 0.803 | 0.818 | 0.828 |
SSGAH | 0.819 | 0.837 | 0.855 | 0.835 | 0.847 | 0.865 |
SSAH | 0.862 | 0.878 | 0.886 | 0.872 | 0.884 | 0.898 |
Datasets
We conduct our experiments on two general datasets, namely CIFAR-10 and NUSWIDE. CIFAR-10 [13] is a small image dataset including 60k 32 32 images in 10 classes. Each image belongs to one class (6000 images per class). NUS-WIDE [6] contains nearly 270k images with 81 semantic concepts. For NUS-WIDE, we follow [21] to use the images associated with the 21 most frequent concepts, where each of these concepts associated with at least 5,000 images. Following [21, 32], we randomly sample 100 images per class as the test set, and the others are as a database. In the training process, we randomly sample 500 images per class from the database as labeled data, and the others are as unlabeled data.
We further verify our experiments on two widely-used fine-grained datasets, namely Stanford Dogs-120 and CUB Bird. Stanford Dogs-120 [22] dataset consists of 20,580 images in 120 mutually classes. Each class contains about 150 images. CUB Bird [31] includes 11,788 images in mutually 200 classes. We directly use test set defined in these datasets. The train set is used as a database. In the training process, we randomly sample 50% images per class s from the database as labeled data, and the others are as unlabeled data.
Comparative Methods and Evaluation Metrics
For the general datasets, including CIFAR-10 and NUSWIDE dataset, we compare our method (SSAH) with three supervised deep hashing methods: CNNH [35], HashGAN [3], three semi-supervised deep hashing methods: SSDH [37], BGDH [36], SSGAH [32] and three shallow methods: ITQ-CCA [7], KSH [20], SDH [27]. For the fine-grained datasets, our method is further compared with DSaH [28], which is the first hashing method designed for fine-grained retrieval.
For a fair comparison between traditional and deep hashing methods, we conduct these methods on features extracted from the fc7 layer of AlexNet which is pre-trained on ImageNet. For deep hashing methods, we use as input the original images, and adopt AlexNet
[14] as the backbone architecture.Evaluation Metric. We use Mean Average Precision (mAP) for quantitative evaluation. The Precision@top-N curves and Precision-Recall curves are shown in supple materials.
Implementation Details
Network Design. As shown in Figure 2, our model consists of an A-Net including MSMN, MARN, and a H-Net. For MSMN, we adopt a lightweight version of generator in GANimation [24]
. This network contains a stride-2 convolution, three residual blocks
[9], and a 1/2-strided convolution. Similar to Johnson et al. [12], we use instance normalization [30]. The figurations of MSMN are shown in the supplematerials. MARN is built upon the STN [11]. Different from STN, our transformer network produces
sets of affine parameters. We adopt AlexNet [14] as the encoder of H-Net, fine-tune all layers but the last one are copied from the pre-trained AlexNet.Trainind Details. Our SSAH
is implemented on PyTorch and the deep model is trained by batch gradient descent. In the training stage, images are regarded as input in the form of batch and every two images in the same batch construct an image pair. Practically, we train A-Net before H-Net. If we first train H-Net, A-Net might output semantic-irrelevant hard generated images, which would be a bad sample and guide the training of hashing model in the wrong direction.
Network Parameters. The value of hyper-parameter is 1.0, is 0.5, is 0.5 and
is 0.1. We use the mini-batch stochastic gradient descent with 0.9 momentum. We set the value of the margin parameters
as , which increases every 5 epochs. The mini-batch size of images is fixed as 32 and the weight decay parameter as 0.0005. The value of the number of the rotated hard samples is 3.Quantitative Results
Performance on general hashing datasets. The Mean Average Precision (mAP,%) results of different methods for different numbers of bits on NUSWIDE and CIFAR-10 dataset are shown in Table 1. Experimental results show that SSAH outperforms state-of-the-art SSGAH [32] by , on CIFAR10, and NUSWIDE, respectively. According to the experimental results, SSAH can be seen to be more effective for traditional hashing task.
Performance on fine-grained hashing datasets. The fine-grained retrieval task requires methods describing fine-grained objects that share similar overall appearance but have a subtle difference. To meet this requirement, there will be greater demand for collecting and annotating data, where professional knowledge is required in some cases. Since it is more difficult to generate fine-grained objects, GAN-based hashing methods are also not effective due to the scarcity of supervised data. However, the experimental results show that our method is still robust to this task.
The mAP results of different methods on fine-grained datasets are shown in Table 2. The proposed SSAH method substantially outperforms all the comparison methods. Compared with existing best retrieval performance (DSaH (Alexnet)), SSAH achieves absolute increases of , on CUB Brid datasets and on Stanford dog datasets, respectively. Compared with the mAP results on traditional hashing task, our method is proved to achieve a significant improvement in fine-grained retrieval. What’s more, we only use the H-Net to produce binary codes in the test phase and the DSaH method need to highlight the salient regions before the encoding process. Thus, our method is also more efficient in time.
Dataset | Stanford Dogs-120 | CUB Bird | ||||
---|---|---|---|---|---|---|
12 bits | 24 bits | 48 bits | 12 bits | 24 bits | 48 bits | |
HashGAN | 0.029 | 0.172 | 0.283 | 0.020 | 0.0542 | 0.123 |
SSGAH | 0.127 | 0.233 | 0.329 | 0.073 | 0.1321 | 0.247 |
DSaH | 0.244 | 0.287 | 0.408 | 0.091 | 0.2087 | 0.285 |
SSAH | 0.273 | 0.343 | 0.478 | 0.141 | 0.265 | 0.359 |
Dataset | CIFAR-10 | NUSWIDE | ||||
---|---|---|---|---|---|---|
12 bits | 24 bits | 48 bits | 12 bits | 24 bits | 48 bits | |
ITQ-CCA | 0.157 | 0.165 | 0.201 | 0.488 | 0.493 | 0.503 |
SDH | 0.185 | 0.193 | 0.213 | 0.471 | 0.490 | 0.507 |
CNNH | 0.210 | 0.225 | 0.231 | 0.445 | 0.463 | 0.477 |
DRSCH | 0.219 | 0.223 | 0.251 | 0.457 | 0.464 | 0.360 |
NINH | 0.241 | 0.249 | 0.272 | 0.484 | 0.483 | 0.487 |
SSDH | 0.285 | 0.291 | 0.325 | 0.510 | 0.533 | 0.551 |
SSGAH | 0.309 | 0.323 | 0.339 | 0.539 | 0.553 | 0.579 |
SSAH | 0.338 | 0.370 | 0.379 | 0.569 | 0.571 | 0.596 |
Performance of Unseen Classes. To further evaluate our SSAH approach, we adopt the evaluation protocol from [26]. In the training process, 75% of classes (termed set 75) are known, and the remaining 25% classes (termed set 25) are used to for evaluation. The set 75 and set 25 are further divided into the training set and test set. Data amount in train and test set are the same. Following settings in [37], we use train75 as the training set and test25 as the test set. For general hashing retrieval, the set75 of CIFAR-10 and NUS- WIDE consist of 7 classes and 15 classes respectively, results are calculated by the average of 5 different random splits, mAP scores are calculated based on all returned images.
The mAP scores under the retrieval of unseen classes are shown in Table 3. Our SSAH method achieves the best result when retrieving unseen classes, which means that our method achieves better generalization performance to unseen class. The experimental results on fine-grained datasets are shown in supple materials.
Ablation Study
To further verify our method, we conduct some analysis experiments including: (1) the effectiveness of hard samples generation, (2) the analysis of each loss component, (3) the effectiveness of self-paced generation policy.
Component Analysis of the Network. We compare our MARN and MSMN with random image rotation/random mask generation strategy in training stage using the AlexNet architecture. (1) Random Image Rotation: For each image, we obtain three rotated images, where each image are rotated in the specified angle range. (2) Random Mask Generation: The values of multiplicative masks are in range of and that of additive masks are in range of . The values of multiplicative and additive masks are required in the range of .
We report our results for using MARN and MSMN in Table 4. For the AlexNet architecture, the mAP of our implemented baseline is and on CIFAR-10 and NUS-WIDE datasets. Based on this setting, joint learning with our MARN model improves baseline by and , respectively on these datasets. Joint learning with the MSMN model improves baseline by and . As both methods are complementary to each other, combining MARN and MSMN into our model gives another boost to and on CIFAR-10 and NUS-WIDE datasets, respectively.
Methods | CIFAR-10 | NUSWIDE | ||
---|---|---|---|---|
12 bits | 48 bits | 12 bits | 48 bits | |
baseline | 0.751 | 0.792 | 0.762 | 0.801 |
random rotate | 0.788 | 0.810 | 0.797 | 0.828 |
+MARN | 0.801 | 0.831 | 0.810 | 0.842 |
random mask | 0.792 | 0.813 | 0.806 | 0.811 |
+MSMN () | 0.811 | 0.838 | 0819 | 0.832 |
+MSMN () | 0.820 | 0.847 | 0.831 | 0.849 |
+MSMN | 0.830 | 0.853 | 0.840 | 0.861 |
random rotate+mask | 0.796 | 0.816 | 0.810 | 0.833 |
Ours(full) | 0.862 | 0.886 | 0.872 | 0.898 |
Component Analysis of the Loss Functions. Our loss function consists of three major components: , and . includes that of hard and cross-domain image pairs, which are denoted as and . To evaluate the contribution of each loss, we study the effect of different loss combinations on retrieval performance. From Table 5, when we use , and to train H-Net, and and to train A-Net, the retrieval performance is best. For the H-Net, the self-paced adversarial loss may destroy the training procedure of accurate binary codes. Combined with , our method further improves about , which shows H-Net capture the semantic information of unlabeled data. We also show some typical visualization results of hard masks using different loss components of the A-Net in Figure 4.
As shown in Figure 4, for the A-Net, if we only use the semantic loss, the generated mask would avoid the object. if we only use the self-paced adversarial loss, the generated mask occludes the object in some cases. However, using the combination of and can obtain a proper mask. The object is partially occluded and it still can be recognized.

A-Net | H-Net | CIFAR-10 | ||
---|---|---|---|---|
12 bits | 24 bits | 48 bits | ||
0.809 | 0.823 | 0.85 | ||
0.821 | 0.843 | 0.862 | ||
0.836 | 0.850 | 0.870 | ||
0.848 | 0.862 | 0.872 | ||
0.841 | 0.855 | 0.867 | ||
0.862 | 0.878 | 0.886 | ||
0.843 | 0.852 | 0.8612 |
The effectiveness of Self-paced Hard Generation Policy. In this section, we evaluate SSAH on the impact of generation policy by self-paced vs. fixed-paced loss. To define the fixed-paced loss, the dynamic margin in Eq (5) is replace by a fixed parameter . The fixed-paced loss is formulated as .
As shown in Table 6, compare with fixed-paced loss, the self-paced adversarial loss is more effective and also robust to the margin parameter . A possible reason is that the fixed-paced loss can not match the dynamic training procedure, and balance hard pairs and simple ones.
Margin Parameters | Self-paced Loss | Fixed-paced Loss | ||
---|---|---|---|---|
12 bits | 48 bits | 12 bits | 48 bits | |
0.01 | 0.790 | 0.803 | 0.791 | 0.815 |
0.05 | 0.843 | 0.855 | 0.819 | 0.842 |
0.3 | 0.855 | 0.873 | 0.838 | 0.852 |
0.5 | 0.846 | 0.861 | 0.789 | 0.813 |
1.0 | 0.849 | 0.852 | 0.775 | 0.801 |
Ours(0.1) | 0.862 | 0.886 | 0.839 | 0.861 |
Conclusion
To solve the data insufficiency problem, we propose a Semi-supervised Self-paced Adversarial Hashing (SSAH) method, consisting of an adversarial network (A-Net) and a hashing network (H-Net). To exploit the semantic information in images, and their corresponding hard generative images, we adopt a supervised semantic loss and a novel semi-supervised consistent loss to train the H-Net. Then the H-Net is used to guide the training of A-Net by a novel self-paced adversarial loss to produce multi-scale masks and some sets of deformation parameters. The A-Net and the H-Net are trained iteratively in an adversarial way. Extensive experimental results demonstrate the effectiveness of SSAH.
Acknowledgements This work was supported by the National Natural Science Foundation of China under Project No. 61772158, 61702136, and U1711265.
References
- [1] (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS’06. 47th Annual IEEE Symposium on, Cited by: Introduction.
- [2] (2014) Learning with pseudo-ensembles. In Nips, pp. 3365–3373. Cited by: Loss Functions.
- [3] (2018) HashGAN: deep learning to hash with pair conditional wasserstein gan. In CVPR, pp. 1287–1296. Cited by: Introduction, Related Work, Comparative Methods and Evaluation Metrics.
- [4] (2018) Deep cauchy hashing for hamming space retrieval. In CVPR, Cited by: Introduction.
- [5] (2017) Deep visual-semantic quantization for efficient image retrieval. In CVPR, Cited by: Introduction.
- [6] (2009) NUS-wide: a real-world web image database from national university of singapore. In MM, Cited by: Datasets.
- [7] (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on PAMI 35 (12), pp. 2916–2929. Cited by: Comparative Methods and Evaluation Metrics.
- [8] (2014) Generative adversarial nets. In Nips, pp. 2672–2680. Cited by: Introduction, Related Work.
- [9] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Implementation Details.
- [10] (2018) Adversarially occluded samples for person re-identification. In CVPR, Cited by: Related Work.
- [11] (2015) Spatial transformer networks. In Nips, pp. 2017–2025. Cited by: Adversarial Network, Implementation Details.
-
[12]
(2016)
Perceptual losses for real-time style transfer and super-resolution
. In ECCV, pp. 694–711. Cited by: Implementation Details. - [13] (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto. Cited by: Datasets.
-
[14]
(2012)
Imagenet classification with deep convolutional neural networks
. In Nips, Cited by: Hashing Learning Network, Comparative Methods and Evaluation Metrics, Implementation Details. - [15] (2010) Self-paced learning for latent variable models. In Nips, pp. 1189–1197. Cited by: footnote 2.
- [16] (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: Loss Functions.
- [17] (2019) Towards optimal discrete online hashing with balanced similarity. arXiv preprint arXiv:1901.10185. Cited by: Introduction.
- [18] (2018) Supervised online hashing via hadamard codebook learning. In MM, Cited by: Introduction.
- [19] (2018) Deep triplet quantization. MM, ACM. Cited by: Introduction.
- [20] (2012) Supervised hashing with kernels. In CVPR, Cited by: Comparative Methods and Evaluation Metrics.
-
[21]
(2011)
Hashing with graphs.
In
Proceedings of the 28th international conference on machine learning
, pp. 1–8. Cited by: Related Work, Datasets. - [22] (2006) A visual vocabulary for flower classification. In CVPR, Vol. 2, pp. 1447–1454. Cited by: Datasets.
-
[23]
(2018)
Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation
. In CVPR, pp. 2226–2234. Cited by: Related Work. - [24] (2018) Ganimation: anatomically-aware facial animation from a single image. In ECCV, pp. 818–833. Cited by: Implementation Details.
- [25] (2017) Deep semantic hashing with generative adversarial networks. In ACM SIGIR, pp. 225–234. Cited by: Introduction, Introduction, Related Work.
- [26] (2017) How should we evaluate supervised hashing?. In ICASSP, Cited by: Quantitative Results.
-
[27]
(2015)
Supervised discrete hashing.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 37–45. Cited by: Comparative Methods and Evaluation Metrics. - [28] (2018) Deep saliency hashing for fine-grained retrieval. arXiv preprint arXiv:1807.01459. Cited by: Introduction, Comparative Methods and Evaluation Metrics.
- [29] (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Nips, Cited by: Loss Functions.
- [30] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: Implementation Details.
- [31] (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, CIT. Cited by: Datasets.
- [32] (2018) Semi-supervised generative adversarial hashing for image retrieval. In ECCV, Cited by: Introduction, Introduction, Related Work, Datasets, Comparative Methods and Evaluation Metrics, Quantitative Results.
- [33] (2010) Semi-supervised hashing for scalable image retrieval. In CVPR, Cited by: Related Work.
- [34] (2017) A-fast-rcnn: hard positive generation via adversary for object detection. In CVPR, Cited by: Related Work.
- [35] (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1, pp. 2156–2162. Cited by: Related Work, Comparative Methods and Evaluation Metrics.
- [36] (2017) Semi-supervised deep hashing with a bipartite graph.. In IJCAI, pp. 3238–3244. Cited by: Introduction, Related Work, Comparative Methods and Evaluation Metrics.
- [37] (2017) SSDH: semi-supervised deep hashing for large scale image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 29 (1), pp. 212–225. Cited by: Introduction, Related Work, Comparative Methods and Evaluation Metrics, Quantitative Results.
- [38] (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In ECCV, Cited by: Related Work.
- [39] (2017) Random erasing data augmentation. arXiv preprint arXiv:1708.04896. Cited by: Related Work.
Comments
There are no comments yet.