Self-supervised asymmetric deep hashing with margin-scalable constraint for image retrieval

12/07/2020 ∙ by Zhengyang Yu, et al. ∙ Southwest University 0

Due to its validity and rapidity, image retrieval based on deep hashing approaches is widely concerned especially in large-scale visual search. However, many existing deep hashing methods inadequately utilize label information as guidance of feature learning network without more advanced exploration in semantic space, besides the similarity correlations in hamming space are not fully discovered and embedded into hash codes, by which the retrieval quality is diminished with inefficient preservation of pairwise correlations and multi-label semantics. To cope with these problems, we propose a novel self-supervised asymmetric deep hashing with margin-scalable constraint(SADH) approach for image retrieval. SADH implements a self-supervised network to preserve supreme semantic information in a semantic feature map and a semantic code map for each semantics of the given dataset, which efficiently-and-precisely guides a feature learning network to preserve multi-label semantic information with asymmetric learning strategy. Moreover, for the feature learning part, by further exploiting semantic maps, a new margin-scalable constraint is employed for both highly-accurate construction of pairwise correlation in the hamming space and more discriminative hash code representation. Extensive empirical research on three benchmark datasets validate that the proposed method outperforms several state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The image and video data in social networks and search engines are growing at an alarming rate. In order to effectively search large-scale high dimensional image data, the researchers proposed the Approximate Nearest Neighbor (A- NN)[1, 2]. As an ANN algorithm, the hash algorithm is widely used in the field of large-scale image retrieval. It maps high-dimensional content features of pictures into Ha- mming space (binary space) to generate a low-dimensional hash sequence[1, 2].

Hash algorithms can be broadly divided into data-depen- dent methods and data-independent methods[31] schemes. T-he most basic but representative data independent method is Locality Sensitive Hashing LSH[1]

, which generates embedding through random projections. However, these methods all require long binary code to achieve accuracy, which is not adapt to the processing of large-scale visual data. Recent research priorities have shifted to data-dependent approaches that can generate compact binary codes by learning large amount of data and information. This type of method embeds high-dimensional data into the Hamming space and performs bitwise operations to find similar objects. Recent data-dependent works such as ITQ, KMH, KSH, BRE ,SH ,MLH and Hamming Distance Metric Learning

[2, 5, 6, 7, 8, 59, 60] have shown better retrieval accuracy under smaller hash code len-gth.

Although the above data-dependent hashing methods ha- ve certainly succeeded to some extent, they all use hand-crafted features that do not capture the semantic information under dramatic changes in real data, thereby limiting the retrieval accuracy of learning binary code. Recently, the deep-learning-based hashing methods have shown superior performance by combining the powerful feature extraction of deep learning

[9, 25, 11, 12, 13, 14]

. Nevertheless, most of these approaches simply employ supervised information and pairwise correlation constrains as guidance during deep feature learning and hash code generation, which seemingly leads to two problems. Firstly, such supervising mode may be beneficial for the extraction of deep features, whereas the quality of the generated hash code is relatively neglected. Secondly, it is also noteworthy that in many datasets such as NUS-WIDE

[15] and MIRFlickr-25K[16], an image is annotated with multi-labeled semantics, where the item pairs that share more relevant semantics should be generating hashing codes that are closer in the hamming space comparing to those with less semantic similarity, thus it is crucial to build a more advanced correlation mechanism to store more advanced similarity information between image pairs and to generate discriminative hash codes with rich multi-label semantic information.

To tackle the mentioned flaws, we proposed a novel self-supervised asymmetric deep hashing with margin-scalable constraint(SADH) approach to improve the accuracy and efficiency of image retrieval. Despite the feature learning strategy with direct semantic supervision and similarity constrai- nts in many deep hashing methods, we proposed a self-supe- rvised network to generate a class-based semantic code map and a semantic feature map with rich semantic information, which can supervise the feature learning network in asymmetric fashion. Meanwhile, a new margin-scalable constraint is proposed based on sematic maps generated by self-supervi- sed network to build highly-precise pairwise correlations wh- ich in turn effectively cluster the high-dimensional feature and hash code of similar item pairs, and to separate the dissimilar ones simultaneously. The main contributions of this paper can be concluded as follows:

  • Aiming to preserve more semantic information in the feature learning network. An asymmetric hashing met- hod is proposed with the generation of a hash code map and a semantic feature map by a self-supervised network. To the best of our knowledge, it is the first work to utilize asymmetric learning strategy in the pre- servation of semantic information for hashing function learning of large-scale visual search.

  • Based on the utilization of semantic maps. A margin-scalable constraint is proposed in the pursuit of discriminative hash code representation of feature learning network with the exploration of highly-accurate pairwise correlations and abundant multi-label semantics.

  • The experimental results on CIFAR-10, NUS-WIDE and MIRFlickr-25K outperforms several state-of-the-art image retrieval hashing methods.

2 Related work

2.1 Non-deep hashing methods

The unsupervised hashing method[17, 2, 32, 45, 50, 51, 52] attempts to map the original features to the Hamming space while using the unlabeled data to preserve the similarity relationship between the original features. Including Isotropic hashing[19], Spectral Hashing (SH)[8], PCA-Iterative Quantization (ITQ) [2], etc. However, unsupervised hashing methods may lose rich semantic information in image tags. To handle more complexed semantic similarities, supervised approaches has been proposed to exploit tag information. Supervised hash with kernel (KSH) [6] and supervised discrete hash (SDH) [19] generate binary hash co-des by minimizing Hamming distance through similar data point pairs. Distortion Minimization Hash (DMS)[59], Minimum Loss Hash (MLH)[21] and Order Retention Hash (OP-H)[22] learn to disperse by minimizing triplet loss based on pairs of similar pairs. Although the above hashing methods have certainly succeeded to some extent, they all use hand-crafted features that do not fully capture the semantic information under dramatic changes in real data, thereby limiting the retrieval accuracy of learning binary code.

2.2 Deep hashing methods

Recently, the deep learning-based hashing methods have shown superior performance by combining the powerful feature extraction of deep learning[21, 23, 24, 53, 54, 55, 56, 57, 58]

. In particular, Convolutional Neural Network Hash (CNNH)

[42] is a two-stage hashing method that learns hash codes and deep hash functions for image retrieval, respectively. DNNH [25] improved [42] by simultaneous feature learning and hash code learning, similarly DSPH[9] performs joint hash code learning and feature learning with pairwise labels. HashNet[12]

equip the deep network with sign as activation function to directly optimize hashing function. DDSH

[27] utilizes semantic labels to directly supervise both hash code learning and deep feature learning. Both pairwise label information and classification supervision are used in DSDH[28] to learn hash codes under single framework. Although these methods have obtained satisfactory retrieval performance, they ignore to construct more precise pairwise correlations between pairs of hash codes and deep features, which may dow-ngrade retrieval accuracy. Besides, they mos-tly hardness supervised information to directly guide feature learning procedure of deep networks, which may suffer substantial loss in semantic information.

To solve this limitation, some methods have been proposed to further enrich the semantic information of hash cod-es in addition to direct semantic supervision, DSEH and AD-SQ[29, 30] utilizes self-supervised networks to capture rich pairwise semantic information to guide feature learning network from both semantic level and hash codes level, thus enriching semantic information and pairwise correlation in hash codes. In comparison to [29, 30] our work improves the guidance mechanism of self-supervised network on feature learning network to be less time-consuming and more semantic-targeted with asymmetric learning strategy. Moreover, the generated semantic codes are further exploited to identify scalable margins for pairwise contrastive-form constraint for higher-level pairwise correlations and highly-disc-riminative hash code representations.

2.3 Asymmetric hashing methods

Asymmetric hashing methods[30, 31, 32, 33] have recently becoming an eye-catching research focus. In ADSH [32], query points and database points are treated asymmetrically, with only query points being engaged in the stage of updating deep network parameters, while the hash codes for database are directly learned as independent parameters, the hash cod-es generated by query and database are correlated through asymmetric pairwise constraints, such that the dataset points can be efficiently utilized during hashing function learning procedure. AGAH[34] adopt an asymmetric discrete loss to ensure multi-label semantic preservation in cross modality hashing retrieval. To the best of our knowledge, our work is the first attempt to exploit asymmetric learning strategy in the guidance of self-supervised network on feature learning network in the pursuit of comprehensive and efficient semantic preservation.

Figure 1:

The overall framework of ImgNet and LabNet in our proposed SADH, ImgNet is comprised of deep convolutional layer(CNNs) for deep image representations, while LabNet is an end-to-end fully connected deep neural network which abstracts semantic features with one-hot annotations as inputs. Both networks embeds deep features into semantic space through semantic layer, and independently obtains classification outputs and binary code under multi-task learning framework. The asymmetric guidance mechanism between networks is illustrated in Fig.

2.
Figure 2: The overview of proposed asymmetric guidance between LabNet and ImgNet in our SADH. A group of hash codes and their related deep features related to each semantics in the given dataset(semantic maps) is obtained by converged LabNet. Which asymmetrically guides ImgNet, which is fed into images in the entire trainset. The semantic maps is further exploited to define scalable margins in the pairwise constraint of ImgNet.

3 The proposed method

We elaborate our proposed SADH in details. Firstly, the problem formulation for hash function learning is presented. Afterwards, each modules as well as optimization strategy in self-supervised network (namely LabNet) and feature learning network (namely ImgNet) are explicitly described. As can be seen in the overall framework Fig.1

, SADH consists of two networks, where LabNet is a deep fully-connected network for semantic preservation with one-hot labels as inputs. While ImgNet is a convolutional neuron network which maps input images into binary hash codes, with both deep features (generated by semantic layer) and hash codes (generated by hash layer) under asymmetric guidance of LabNet as shown in Fig.

2.

3.1 Problem definition

Considering the notations in the first place. In an image retrieval problem, let denote a dataset with m instances, and where is the original image feature from the l-th sample, assuming that there are classes in this dataset, will be annotated with multi-label semantic , where indicates that belongs to the j-th class, or on the contrast. The image-feature matrix is defined as , and the label matrix as for all instances. The pairwise multi-label similarity matrix is used to describe semantic similarities between each of the two instances, where means that is semantically similar to otherwise . In a multi-label setting, two instances and are annotated by multiple labels. Thus, we define if and share at least one label, otherwise . The main goal in deep hashing retrieval is to identify a nonlinear hash function, i.e., , where is the length of each hash codes, to encode each item into -bit hash codes ,where the correlations of each item pairs are stored. The similarity between a hash code pair are evaluated by their Hamming distance which might be challenging with costly calculation [61]. Inner-product as a surrogate can be related with hamming distance in form of :

(1)

3.2 Self-supervised network

To enrich the semantic information in generated hash codes, we designed a fully-connected LabNet to leverage abundant semantic correlations from multi-label annotations as a guidance of the further feature learning process of Img-Net.

LabNet extracts high-dimensional semantic features thr-ough fully-connected layers with multi-label annotations as inputs, i.e., , where is the nonlinear hash function for LabNet, while is the parameters for LabNet. With a sign function the learned can be discretized into binary codes:

(2)

For more advanced preservation of semantic information especially in multi-label scenarios, the high-dimensional semantic features of LabNet are also exploited to supervise the semantic learning of ImgNet.

3.2.1 Cosine-distance-based similarity evaluation

In hamming space, the similarity of two hash codes , can be defined by the Hamming distance dist

, aiming at saving the similarity of item pairs with similar pairs clustered and dissimilar pairs scattered, the similarity loss function of LabNet can be defined as follows:

(3)

Where denotes the similarity loss function, by which the similarity of two generated hash codes and can be preserved. dis represents the hamming distance between and . To avoid the collapsed scenario[47], a contrastive form of loss function is applied with a margin parameter , with which the hamming distance of generated hash code pairs are expected to be less than . With the mentioned relationship (1) between Hamming distance and inner-product ,the similarity loss can be redefined as:

(4)

Where the margin parameter incudes the inner-product of dissimilar pairs to be less than , while that of similar ones to be larger than . For enhancement of similarity preservation, we expect the similarity constraint to be extended to ensure the discrimination of deep semantic features. However with the gap of feature distribution between Labnet and Imgnet, inner-product will no longer be a plausible choice of the similarity evaluation between semantic features of two networks, since the choice of margin parameter is ambiguous. one way to resolve this flaw is to equip two networks with same activate function such as sigmoid or tanh at the output of semantic layer to limit the scale of output feature in a fixed range, however we expect both of the networks to remain its own scale of deep features. Considering the fact that hash codes are discretized to either -1 or 1 at each bit, meanwhile each generated hash codes are in the same length, thus in the similarity evaluation in hamming space, we focus more on the angle between hash codes, instead of the absolute distance between them, this is why we adopt cosine distance as a replacement:

(5)

Where

. Although pairwise label information is adopted to store the semantic similarity of hash codes, the label information is not fully exploit. Thus Labnet will exploit semantic information through classification task and hashing task jointly. Many recent works directly map the learned binary codes into classification predictions through linear classifier

[28, 29]. To prevent the interference between classification stream and hashing stream, and to avoid the classification performance being too sensitive to the length of hash codes, as can be seen in fig 1, both hash layer and classification layer are performed as output layers in Labnet by a multi-task learning strategy[35, 36].

The final object function of LabNet can be formulated as:

(6)

Where . and are similarity loss for the learned semantic features and hash codes respectively with The classification loss calculates the difference between input labels and predicted labels. is the quantization loss for the discretization of learned hash codes.

3.2.2 Asymmetric learning strategy

In many self-supervised methods [62, 22], the self-super-vised network is trained with feature learning network simultaneously, with one image input into Imgnet and it’s label input into Labnet. Under such mechanism, two networks will generate the same number of hash codes with one of which supervise the other by pairwise similarity constraint. In this paper, we focus on optimizing such form of guidance in a more efficient and semantic-targeted way. Inspired by ADSH where query points and database points are treated asymmetrically with deep hash function learned for only que-ry points. As illustrated in Fig.2, we use the converged LabNet to generate semantic features and hash codes with only binary representation that corresponds to each semantic labels as inputs, which gives hash code map where and corresponding semantic feature map , where

is the total number of classes. With the adoption of cosine similarity, the scale of semantic features is not constrained.

3.3 Feature learning network

We apply a convolutional neural network for image feature learning with MLFN [37] which was originally designed for person re-identification, where information from different semantic levels are extracted by FMs to be fused with the final deep feature layer into our semantic layer. Similar to Labnet, we add extra output layer and hash layer to embed the high-dimensional semantic features into -dimensional hash codes and -dimensional classification predictions under multi-task learning framework. The generation of both of the semantic features and hash codes of ImgNet will be constrained by the semantic maps and generated in LabNet in an asymmetric learning strategy, which gives the following asymmetric discrete loss:

(7)

3.3.1 Margin-scalable constraint

In most of contrastive form of pairwise or triplet similarity constraint used in deep hash methods[38, 39, 32], the choice of margin parameter mainly relies on manual tuning, meanwhile for all the item pairs, the margin parameter is considered as a fixed constant, In this way, it can be hypothesized that the performance of hash learning will be highly sensitive to the choice of margin parameter, which will be demonstrated later in 4.3.2. Also in multi-label scenarios, the expected margin parameter for item pairs that shares more semantic similarities should be larger and vice versa, thus setting single fixed value of margin may downgrade the pairwise information storage. To optimize the selection of margin value, we propose a margin-scalable constraint based on the semantic maps generated by Labnet, spe-cifically for two hash codes and generated by ImgNet, a pair of hash codes and are represented by looking up the semantic code map with respect to the semantics they’re labeled by will be assigned to them. The scalable margin for and is calculated by:

(8)

Under such setting, with , all the positive cosine distance between item pairs in semantic code map will be automatically assigned to item pairs in generated by ImgNet as their scalable margin, while the negative cosine distances will sign margin as 0. This is due to the nature of multi-label tasks, where the ‘dissimilar’ situation only refers to item pairs with zero identical label. While for a random pair of similar items, the number of shared labels might defer in a wide range. Thus, in the pairwise similarity preservation, dissimilar items are given weaker constraint, whereas the similar pairs are constrained in a more precise and strong way, which considers similarity difference between specific items pairs.

With considering equations (7) and (8), the final object function of ImgNet can be formulated as:

(9)

where and are margin scalable loss for generated semantic features and hash codes, with , . and asymmetric loss for hash code and semantic feature with . and are classification loss and quantization loss defined in way that is similar to that defined in LabNet.

3.4 Optimization

It is noteworthy that, the ImgNet is trained after the convergence of LabNet. First we iteratively optimize the objective (6) by exploring multi-label information to learn and with the final trained LabNet we obtain and . Then the parameter of LabNet will be fixed, and wil be optimized through and with the guidance of and . Finally, we obtain binary hash codes . The entire learning algorithm is summarized in Algorithm 1 in more details.

3.4.1 Optimization of LabNet

The gradient of w.r.t each Hash code in sampled mini-batch is

(10)

can be obtained similarly,

can be computed by using the chain rule, then

can be updated for each iteration using Adam with back propagation.

3.4.2 Optimization of ImgNet

The gradient of w.r.t each Hash code in sampled mini-batch is

(11)

Where

. can be obtained similarly, can be computed by using the chain rule, then can be updated for each iteration using SGD with back propagation.

Image set , Label set
semantic feature map , and semantic code map , parameters for Imgnet,
Optimal code matrix for Imgnet
Initialize network parameters and
Hyper-parameters:
Mini-batch size , learningrate:
maximum iteration numbers
Stage1: Hash learning for self-supervised network
for  iteration do
     Calculate derivative using formula (10)
     Update by Adam and BP algorithm
end for
Update semantic feature map and semantic code map by Labnet for each semantic as input
Stage2: Hash learning for feature learning network
for  iteration do
     Calculate derivative using formula (11)
     Update by SGD and BP algorithm
end for
Update the parameter by
Algorithm 1 The learning algorithm of our SADH

4 Experiments and analysis

In this section, we conducted extensive comparison experiments to verify three main issues of our proposed SADH method: (1) To illustrate the retrieval performance of SADH comparing to existing state-of-the-art methods. (2) To evaluate the improvement of efficiency in our method comparing to other methods. (3) To verify the effectiveness of different modules proposed in our method.

4.1 Datasets and experimental settings

The evaluation is based on three mainstream image retrieval datasets: CIFAR-10[40], NUS-WIDE[15], MIRFlickr-25K[16].

CIFAR-10: CIFAR-10 contains 60,000 images with a resolution of . These images are divided into 10 different categories, each with 6,000 images. In the CIFAR-10 experiments, following[41], for each category, we select 100 images as testing set(a total of 1000) and query set, the remaining as database(a total of 59000), 500 images belonging to database in each class are selected as a training set(a total of 5000).

NUS-WIDE: NUS-WIDE contains 269,648 images. Th-is data set is a multi-label image set with 81 ground truth concepts. Following similar protocol of [28, 41], we use the subset of 195834 images which are annotated by the 21 most frequent classes (each category contains at least 5,000 images). Among them, 100 images and 500 images are randomly selected in each class as the query set (2100 in total) and the training set (10500 in total) respectively. Despite query images, remaining 193734 images are selected as database. We specify that when two images have at least one identical label, the two images are considered similar, otherwise the two images are regarded as dissimilar samples.

MIRFlickr-25K: The MIRFlickr25K dataset consists of 25,000 images collected from Flickr website. Each instance is annotated by one or more labels selected from 38 categories. We randomly selected 1,000 images for the query set, 4,000 for the training and the remaining as the retrieval database. Similar to NUS-WIDE, two images that shares at least one identical concept are considered as similar, otherwise they are dissimilar.

We compare our proposed SADH with several state-of-the-art approaches including LSH [1], SH [8], ITQ [2], LFH [43], DSDH [28], HashNet [12], DPSH [9], DBDH [44], CSQ [63] and DSEH [29]. These methods are briefly introduced as follows:

1. Locality-Sensitive Hashing (LSH) [1] is an data-inde-pendent hashing method that employs random projections as hash function.

2. Spectral Hashing (SH) [8]is a spectial method which transfers the original problem of finding best hash code for given dataset into the task of graph partitioning.

3. Iterative quantization (ITQ) [2]

is a classical unsupervised hashing method. It projects data points into low space by using principal component analysis (PCA), then minimize the quantization error for hash code learning.

4. Latent Factor Hashing (LFH) [43] is a supervised method based latent hashing models with convergence guarantee and linear-time variant.

5. Deep Supervised Discrete Hashing (DSDH) [28] is the first supervised deep hashing method that simultaneously utilize both semantic label and pairwise supervised information, the hash layer in DSDH is constrained to be binary codes directly.

6. HashNet [12] is a supervised deep architecture for hash code learning, which includes smooth activation function to resolve the ill-posed gradient problem during training.

7. Deep pairwise-supervised hashing (DPSH) [9] is a representative deep supervised hashing method that jointly performs feature learning and hash code learning for pairwise application.

8. Deep balanced discrete hashing for image retrieval (DBDH) [44]

is a recent remarkable supervised deep hashing method which involves straight-through estimator to actualize discrete gradient propagation.

9. Central Similarity Quantization for Efficient Image and Video Retrieval (CSQ) [63] defines the correlation of hash codes through a global similarity metric, to identify a common center for each hash code pairs.

10. Deep Joint Semantic-Embedding Hashing (DSEH) [29] is a supervised deep hashing method that employs a self-supervised networks to capture abundant semantic information as guidance of a feature learning network.

Among the above approaches, LSH[1], SH[8], ITQ[2], LFH [43] are non-deep hashing methods, for these methods, 4096-dimentional deep features extracted from Alexnet [23] are utilized for two datasets: NUS-WIDE and CIFAR-10 as inputs. Rest of the six baselines (i.e., DSDH, HashNet, DPSH, DBDH and DSEH) are deep hashing methods, for which images on three dataset (i.e., NUS-WIDE, CIFAR-10 and MIRFlickr-25k) are inputs. LSH, SH, ITQ, LFH, DSDH, HashNet and DPSH are carefully carried out based on the source codes provided by the authors, while for the rest of the methods, they are implemented carefully by ourselves and parameters for those baselines are following the suggestion of the original papers.

We evaluate the retrieval quality by three widely used evaluating metrics: Mean Average Precision (MAP), Pre-cision-Recall curve, and Precision curve with the number of top returned results as variable (topK-Precision).

Specifically, given a query instance q, the Average Precision (AP) is given by:

Where is the total number of instances in the databa-se, is the number of similar samples,

is the probability of instances of retrieval results being similar to the query instance at cut-off

, And is the indicator function that suggests the i-th retrieval instance similar to if , otherwise .

The larger the MAP is, the better the retrieval performance. Since NUS-WIDE is relatively large, so when computing MAP for NUS-WIDE, we only consider the top 5,000 neighbors (MAP@5000), while for CIFAR-10 and MIRFlic-kr-25K, we calculate MAP for the entire retrieval database (MAP@ALL).

4.2 Implementation details

The LabNet is built with four fully-connected layers, with which the input labels are transformed into hash codes , where the output includes both -dimensional hash code and -dimensional multi-label predictions, .

We built ImgNet based on MLFN network, with 16 blocks remained and each of the blocks contain 32 FMs, FSM is used to dynamically activate different FMs, thus it is a 32-dimensional vector. The resulting FS has a dimension of 512 (32 FMs

16 blocks) and the dimension of the semantic feature layer is set to , which is followed by the output layer with nodes for hash code generation and

nodes for classification. It is noted that except for output layer, the network is pre-trained on ImageNet dataset.

The implementation of our method is based on the Pytorch framework on a NVIDIA TITAN X GPUs for 120 epoc-hs of training. For hyper-parameters in LabNet, we set

, to 2,0.5,0.5,0.1 respectively. For hyper-parameters in Img-Net, we set to 0.01,1,0.01,1,2 and 0.05 respectively. As can be observed from Fig.3, Labnet remains stably robust retrieval performance under different choices of margin parameter especially under small margin parameters, hence we simply set m to 0 for all the scenarios.

Besides, Adam[45]

is applied to LabNet , while stochastic Gradient descent (SGD) is applied to ImgNet. The batch size is set to 64. The learning rates are chosen from

to with a momentum of 0.9.

4.3 Performance evaluation

Query Top10 Retrieved Images
Portrait Indoor people SADH
 
DSDH
 
Indoor Night SADH
 
DSDH
 
Clouds sky SADH
 
DSDH
 
Table 1: Examples of top 10 retrieved images by SADH and DSDH on MIRFlickr-25K for 48 bits. The semantically incorrect returned images are marked with red border.
Method CIFAR-10 (MAP@ALL)
16 bits 32 bits 48 bits 64 bits
LSH[1] 0.4443 0.5302 0.5839 0.6326
ITQ[2] 0.2094 0.2355 0.2424 0.2535
SH[8] 0.1866 0.1900 0.2044 0.2020
LFH[43] 0.1599 0.1608 0.1705 0.1693
DSDH[28] 0.7514 0.7579 0.7808 0.7690
HashNet[12] 0.6975 0.7821 0.8045 0.8128
DPSH[9] 0.7870 0.7807 0.7982 0.8003
DBDH[44] 0.7892 0.7803 0.7797 0.7914
CSQ[63] 0.7761 0.7775 - 0.7741
DSEH[29] 0.8025 0.8130 0.8214 0.8301
SADH 0.8755 0.8832 0.8913 0.8783
Table 2: MAP@ALL on CIFAR-10.
Method NUS-WIDE (MAP@5000))
16 bits 32 bits 48 bits 64 bits
LSH[1] 0.4443 0.5302 0.5839 0.6326
ITQ[2] 0.2094 0.2355 0.2424 0.2535
SH[8] 0.1866 0.1900 0.2044 0.2020
LFH[43] 0.1599 0.1608 0.1705 0.1693
DSDH[28] 0.7941 0.8076 0.8318 0.8297
HashNet[12] 0.7554 0.8163 0.8340 0.8439
DPSH[9] 0.8094 0.8325 0.8441 0.8520
DBDH[44] 0.8052 0.8107 0.8277 0.8324
CSQ[63] 0.7853 0.8213 - 0.8316
DSEH[29] 0.7319 0.7466 0.7602 0.7721
SADH 0.8352 0.8454 0.8487 0.8503
Table 3: MAP@5000 on NUS-WIDE.
Method MIRFlickr-25K (MAP@ALL)
16 bits 32 bits 48 bits 64 bits
DSDH[28] 0.7541 0.7574 0.7616 0.7680
HashNet[12] 0.7440 0.7685 0.7757 0.7815
DPSH[9] 0.7672 0.7694 0.7722 0.7772
DBDH[44] 0.7530 0.7615 0.7634 0.7653
CSQ[63] 0.6702 0.6735 - 0.6843
DSEH[29] 0.6832 0.6863 0.6974 0.6970
SADH 0.7731 0.7698 0.7993 0.7873
Table 4: MAP@ALL on MIRFLICKR-25K.

4.3.1 Comparison to State of the Art

To validate the retrieval performance of our method, we compare the experimental results of SADH with other state-of-the-art methods including LSH [1], SH [8], ITQ [2], LFH [43], DSDH [28], HashNet [12], DPSH [9], DBDH [44], CSQ [63] and DSEH [29] on CIFAR-10, NUS-WIDE and MIRFL-ICK-R-25K. Table 1 shows the top 10 retrieved images in database for 3 sampled images in MIRFlickr-25K, it can be identified that in difficult cases, SADH reveals better semantical consistency than DSDH. Table 2 to table 4 reports the MAP results of different methods, please be noted that for NUS-WIDE, MAP is calculated within the top 5000 returned neighbors. While Fig.6-11 show the overall retrieval performance of SADH and the other comparison baselines in form of precision-recall curve and precision curves by varying the number of top returned images are shown from 1 to 1000 on NUS-WIDE, CIFAR-10 and MIRFlickr-25K respectively. SADH substantially outperforms all other state-of-the-art comparison methods. It deserves noticing that SA-DH reveals dominance throughout almost all the length of hash bits with steady performance on both datasets, this is due to the multi-task learning structure in our method with which the classification output and hashing output are obtained independently, and two tasks are not mutually interfered. It is also noteworthy that, with abundant semantic information leveraged from self-supervised network and supre-me pairwise information derived from margin-scalable constraint, SADH revealed impressive retrieval performance on both single-labeled CIFAR-10 and multi-labeled datasets (i.e., NUS-WIDE and MIRFlickr-25K).

4.3.2 Sensitivity to margin parameter

To demonstrate the earlier hypothesis of two networks’ sensitivity to margin parameter in contrastive loss, we replace the scalable margin module in ImgNet by margin constant used in LabNet and report their MAP with 48-bit length under different choices of on CIFAR-10 and MIRFl-icker-25K. As shown in Fig. 3, we can see that under different choices of margin, LabNet reveals relatively slight chang-es in MAP, while ImgNet is highly sensitive to the choice of margin with a largest MAP gap of roughly 0.14 at margin = 0 and margin = 0.2 on CIFAR-10. Which to some extend reveals the significance of proper selection of margin and the feasibility of calculating margin for different item pairs rely on the hash codes generated by LabNet based on the independency of its performance to the selection of margin parameter.

(a) CIFAR-10 (b) MIRFlickr-25K
Figure 3: Sensitivity analysis on margin parameter

4.4 Empirical analysis

Three additional experimental settings are designed to further analysis SADH.

4.4.1 Ablation study

We investigate the impact of different proposed modules on the retrieval performance of SADH with partly changed baselines. SADH-asm refers to ImgNet without asymmetric guidance from LabNet, SADH-mars is built by removing the margin-scalable constraint from ImgNet, SADH-cos refers to replacing the cosine similarity module in both ImgNet and LabNet by the widely-used logarithm Maximum a Posterior (MAP) estimation of pairwise similarity loss applied in may deep hashing approaches [29, 28] with form of follows:

(12)

Results are shown on Table 2 for both NUS-WIDE and CIFAR-10 under code length of 32 bits, considering the results, we can see that asymmetric constraint with semantic information from LabNet plays an essential role on the performance of our method, meanwhile margin-scalable constraint from ImgNet itself also significantly improves retrieval accuracy. It can also be observed that cosine similarity can achieve better performance than MAP estimation of pairwise similarity.

Figure 4:

Map during 50 epochs on CIFAR-10 and MIRFlickr-25K with different choice of margins.

As a further demonstration of the effectiveness of margin-scalable constraint, we compare it with several choices of single constants on our SADH. For 50 epochs, the top 5000 MAP results on MIR-Flickr25K and CIFAR-10 are given for every 10 epochs respectively. As illustrated in Fig.4, It can be clearly identified that in both single-labeled and multi-labeled scenario, scalable margin achieves better retrieval accuracy comparing to fixed margin constants. Furthermore, it is observed that on CIFAR-10, scalable margin fastens the convergence of SADH during training.

Methods
NUS-WIDE
(MAP@5000)
CIFAR-10
(MAP@ALL)
SADH-asm 0.7115 0.7701
SADH-mars 0.8174 0.8249
SADH-cos 0.8168 0.8502
SADH 0.8454 0.8832
Table 5: Ablation study on several modules in SADH, with MAP on NUS-WIDE and CIFAR-10 at bit-length 32
Figure 5: Efficiency analysis compared with DSEH.
Figure 6: Precision-recall curves on NUS-WIDE.
Figure 7:TopK-precision curves on NUS-WIDE. Figure 8:Precision-recall curves on CIFAR-10. Figure 9:TopK-precision curves on CIFAR-10.

Figure 10:Precision-recall curves on MIRFlickr-25K.

Figure 11:TopK-precision curves on MIRFlickr-25K.

4.4.2 Training efficiency analysis

Fig 5 shows the change of MAP using 32-bit hash codes during training time of 1000 seconds, with comparison between SADH and DSEH on CIFAR-10. We can observe that SADH reduces training time by approximately two folds to achieve MAP of 0.6, besides SADH reveal the tendency of convergence earlier than DSEH. SADH can achieve higher MAP than DSEH with less time. This is because ImgNet and LabNet are trained jointly for multiple rounds in DSEH, with the generated hash codes and semantic features of ImgNet being supervised by same number of those generated by LabNet, whereas in SADH, the LabNet will cease to train after one round of convergence, and only binary representations that belongs to each semantic will be input into LabNet to produce the hash code map and the semantic feature map to supervise ImgNet training in asymmetric learning strategy, allowing ImgNet to produce hash codes with discriminative representations and abundant pairwise information efficiently.

4.4.3 Visualization of hash codes

Fig.12 is the t-SNE [46] visualization of hash codes generated by DSDH and SADH on CIFAR-10, hash codes that belongs to 10 different classes are assigned with 10 different colors, it can be observed that hash codes in different categories are discriminatively separated by SADH, while the hash codes generated by DSDH doesn’t show such characteristic. This is because the cosine similarity and scalable margin mechanism used in SADH can provide more accurate inter-and-intra-class similarity preservation with more discriminative hash codes in comparison with the mentioned form of pairwise similarity loss (12) used in DSDH.

(a) DSDH (b) SADH

Figure 12:The t-SNE visualization of hash codes learned by

DSDH and SADH

5 Conclusion

In this paper, we present a novel self-supervised asymmetric deep hashing with scalable constraint method, namely SADH, for large-scale image retrieval. Which includes two frameworks, one of which is LabNet, which extracts abundant semantics and pairwise information from semantic labels by a hash code map and semantic feature map, which in turn utilizes an efficiently asymmetric learning strategy to constrain the ImgNet to generate hash codes with discrimination and well-preserved similarities. Additionally, the cosine similarity measurement and margin-scalable constraint are used to precisely-and-efficiently preserve similarity in the hamming space. Comprehensive empirical evidence shows that SADH outperforms several state-of-the-art methods including traditional methods and deep hashing methods on three widely used benchmarks. In the future, we will explore to apply the proposed margin scalable constraint technique to other hash methods like cross-model hashing and real world applications like person re-identification.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61806168), Fundamental Resear-ch Funds for the Central Universities (SWU117059), and Venture & Innovation Support Program for Chongqing Overseas Returnees (CX2018075).

References