Deep Momentum Uncertainty Hashing

09/17/2020 ∙ by Chaoyou Fu, et al. ∙ Horizon Robotics Tsinghua University 14

Discrete optimization is one of the most intractable problems in deep hashing. Previous methods usually mitigate this problem by binary approximation, substituting binary codes for real-values via activation functions or regularizations. However, such approximation leads to uncertainty between real-values and binary ones, degrading retrieval performance. In this paper, we propose a novel Deep Momentum Uncertainty Hashing (DMUH). It explicitly estimates the uncertainty during training and leverages the uncertainty information to guide the approximation process. Specifically, we model bit-level uncertainty via measuring the discrepancy between the output of a hashing network and that of a momentum-updated network. The discrepancy of each bit indicates the uncertainty of the hashing network to the approximate output of that bit. Meanwhile, the mean discrepancy of all bits in a hashing code can be regarded as image-level uncertainty. It embodies the uncertainty of the hashing network to the corresponding input image. The hashing bit and the image with higher uncertainty are paid more attention during optimization. To the best of our knowledge, this is the first work to study the uncertainty in hashing bits. Extensive experiments are conducted on four datasets to verify the superiority of our method, including CIFAR-10, NUS-WIDE, MS-COCO, and a million-scale dataset Clothing1M. Our method achieves best performance on all datasets and surpasses existing state-of-the-arts by a large margin, especially on Clothing1M.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

With the explosive growth of data in practical applications, hashing has received sustained attention due to its advantages in low storage cost and fast computation speed Jiang et al. (2019); Wang et al. (2018). Traditional hashing methods are based on the elaborately designed hand-crafted features Lin et al. (2014a). The binary codes are learned from data distributions Gong and Lazebnik (2011) or obtained by random projection Gionis et al. (1999)

. In recent years, as the thriving of deep learning, deep supervised hashing that combines hashing with deep learning further improves retrieval performance

Zhu and Gao (2017)

. Generally, the last layer of a neural network is leveraged to output binary hashing codes

Fu et al. (2019)

. Early works, such as Convolutional Neural Network Hashing (CNNH)

Xia et al. (2014), adopt a two-stage manner. The feature learning of the neural network and the hashing coding are separate. Subsequent works, e.g., Deep Pairwise-Supervised Hashing (DPSH) Li et al. (2016), perform feature learning and hashing coding in an end-to-end framework, which has shown better performance than the two-stage manner. For all the deep hashing methods, an intractable problem is that the binary hashing code is discrete, which impedes the back-propagation of the gradient in the neural network Jiang et al. (2018). How to solve the discrete optimization of the binary code remains a challenge.

Previous methods usually adopt binary approximation to tackle the above challenge. That is, the binary codes are replaced by continuous real-values, which are enforced to be binary via non-linear activation functions Jiang et al. (2018). Nevertheless, the output of the activation function, such as Sigmoid or Tanh, is easy to be saturated. It inevitably slows down or even limits the training process Liu et al. (2016a). Considering the saturating problem, some recent methods desert the non-linear activation function and impose a regularization on the output to enforce the real-value of each bit to be close to a binary one (+1 or -1) Li et al. (2016). However, these methods equally approximate all bits, while ignore their differences. As shown in Fig. 1, we discover that the approximate output of each bit has a unique change trend. It is obvious that the output of bit-1 has larger change than the output of bit-2 during training. That is to say, the hashing network has higher uncertainty to the approximate output of bit-1. We call such uncertainty for each bit as bit-level uncertainty. Furthermore, if all bits of a hashing code generally have high uncertainty, it indicates that the hashing network has high uncertainty to the corresponding input image. We define the mean bit-level uncertainty of all bits in a hashing code as the image-level uncertainty. As can be seen from Fig. 1, the images with high image-level uncertainty usually contain more complex scenarios, belonging to hard examples Wu et al. (2017).

In order to explicitly estimate the bit-level uncertainty, i.e. the change trends of bits, we need to compare current values with previous ones. A straightforward idea is to store the outputs of all training images in each optimization step and then compare the current outputs with them. Unfortunately, it is unfeasible because of the requirement of huge memory when training with large-scale datasets. Recently, in order to tackle the memory problem in unsupervised and semi-supervised learning

Wu et al. (2018), He et al. (2020); Tarvainen and Valpola (2017); French et al. (2018) propose an extra momentum-updated network that averages model weights during training. The momentum-updated network is an ensemble of previous networks in different optimization steps, outputting ensemble results French et al. (2018). Inspired by this, a momentum-updated network is introduced to obtain previous outputs approximately. As far as we know, it is the first time to introduce the momentum-updated network for uncertainty estimation. We further compare the outputs between the hashing network and the momentum-updated network, and regard the discrepancy as the bit-level uncertainty. According to the magnitude of the uncertainty, we set different regularization weights for different hashing bits. In addition, by averaging the uncertainty of all bits in a hashing code, we get the image-level uncertainty of the corresponding input image. The image with higher uncertainty is paid more attention during the optimization of Hamming distance. The effectiveness of our method is demonstrated on four challenging datasets, including CIFAR-10 Krizhevsky and Hinton (2009), NUS-WIDE Chua et al. (2009), MS-COCO Lin et al. (2014b), and a million-scale dataset Clothing1M Xiao et al. (2015). In summary, the main contributions of our work are as follows:

  • We are the first to explore the uncertainty of hashing bits during approximate optimization. Depending on the magnitude of uncertainty, the corresponding hashing bits and input images receive different attention.

  • We propose to explicitly model bit-level and image-level uncertainty, resorting to the discrepancy between the output of the hashing network and that of the momentum-updated network.

  • Extensive experiments on the CIFAR-10, the NUS-WIDE, the MS-COCO, and the large-scale Clothing1M datasets show that our method significantly improves the retrieval performance when compared with state-of-the-art methods.

Related Work

Hashing Retrieval

Hashing aims at projecting data from high-dimensional pixel space into the low-dimensional binary Hamming space. It has drawn substantial attention of researchers due to the low time and space complexity. Current hashing can be grouped into two categories, including data-independent hashing methods and data-dependent hashing methods. For the data-independent hashing, the binary hashing codes are generated by random projection or manually constructed, such as the locality sensitive hashing (LSH) Gionis et al. (1999). Since data-independent hashing usually requires long code length to guarantee the retrieval performance, the more efficient data-dependent hashing that leans hashing codes from data has gained more attention in recent years Jiang and Li (2018).

The data-dependent hashing can be further divided into unsupervised hashing and supervised hashing, according to whether use the supervised similarity labels or not. Iterative quantization hashing (ITQ) Gong and Lazebnik (2011) and ordinal embedding hashing (OEH) Liu et al. (2016b)

are representative unsupervised hashing methods. Both of them retrieval the neighbors by exploring the metric structure in the data. Though unsupervised learning avoids the annotation demand of training data, exploiting the available supervisory information usually implies a better performance. Representative supervised hashing methods based on hand-crafted features include supervised hashing with kernels (KSH)

Liu et al. (2012), latent factor hashing (LFH) Zhang et al. (2014) and column-sampling based discrete supervised hashing (COSDISH) Kang et al. (2016), all of which achieve impressive results.

Benefiting from the powerful representation ability of deep neural networks, supervised hashing has made great progress in the last few years Lai et al. (2015)

. The deep learning based supervised hashing is called deep supervised hashing, which is a hot research direction in the community of machine learning and computer vision. Convolutional neural network hashing (CNNH)

Xia et al. (2014) and deep pairwise-supervised hashing (DPSH) Li et al. (2016) are representative methods. Other recent works include Huang et al. (2019); Chen et al. (2019); Yang et al. (2019); Lin et al. (2019); Shen et al. (2020); Cui et al. (2020)

Uncertainty in Deep Learning

Here, uncertainty means the uncertainty of the deep neural network to the current outputs. For traditional deep learning, the network only outputs a deterministic result. However, in many scenarios, such as autonomous driving, we would like to be able to simultaneously obtain the uncertainty of the network to that output. It will facilitate reliability assessment and risk-based decision Chang et al. (2020). Therefore, uncertainty has received much attention in recent years Gal and Ghahramani (2016); Kendall et al. (2015); Kendall and Gal (2017). Gal and Ghahramani (2016)

develops an approximate Bayesian inference framework to represent model uncertainty, which denotes the uncertainty existed in model parameters.

Kendall and Gal (2017) proposes to estimate model uncertainty and data uncertainty (existed in the training data) in a unified framework. Although the uncertainty has been widely explored in various tasks, including object detection Choi et al. (2019), semantic segmentation Kendall et al. (2015), domain adaption Zheng and Yang (2020)

, face recognition

Shi and Jain (2019), and 3D deformable learning Wu et al. (2020), they are unsuitable for deep hashing because its unique binary property.

Preliminaries

Notation

Uppercase letters such as are used to denote matrices, and lowercase letters like are used to denote the -th element of . indicates the transpose of the matrix .

denotes the element-wise product of two vectors

and . sign() means the element-wise sign function, which returns and when the element is positive and negative, respectively.

Problem Definition

Suppose there are images , where means the -th image. For deep supervised hashing, the pairwise similarity between two images is also available. The similarity information is denoted as with . means and are similar, while means and are dissimilar.

The purpose of deep supervised hashing is learning a function that maps the data from high-dimensional pixel space to the low-dimensional binary Hamming space. That is, for each image , we can get a binary hashing code , where means that the code has bits. Meanwhile, the semantic similarity should be consistent before and after mapping. For example, if , and should have shorter Hamming distance. Otherwise if , and should have longer Hamming distance. Hamming distance of two binary codes is defined as:

(1)

It is obvious that the product of and can be used to measure Hamming distance.

Figure 2: The framework of our method, which consists of a hashing network and a momentum-updated network . Apart from outputting the approximate binary values as previous works, our method also gives the uncertainty to the current approximate outputs.

Method

Overall Framework

As shown in Fig. 2, our framework consists of two networks: a hashing network and a momentum-updated network . and are the weights of the two networks, respectively. Moreover, the two networks have the same architecture: a backbone for feature learning as well as a fully-connected layer for approximate binary coding. The difference is that the weights are updated via the back-propagation of gradient, while the weights are updated by averaging .

Given an input image , the two networks output approximate binary values and , respectively. Since can be seen as an ensemble of French et al. (2018), we can approximatively calculate the change of via comparing with . As mentioned in Introduction, such change is regarded as the uncertainty of the hashing network to the current output value. For each bit, the larger the difference between and , the more uncertainty the hashing network to that bit. By this means, we can get the bit-level uncertainty. For instance, as shown in Fig. 2, it is obvious that the hashing network is more uncertain about the output of the third bit. In addition, by averaging the uncertainty values of all bits of in a hashing code, we can obtain the image-level uncertainty. It represents the uncertainty of the hashing network to the corresponding input image. After obtaining the bit-level and the image-level uncertainty, we will set different attention for different bits and images during training. The detailed optimization is introduced in the following part.

Hashing Learning Revisit

Given the binary hashing codes , the likelihood of the pairwise similarity is formulated as Zhang et al. (2014); Li et al. (2016):

(2)

where and . Considering the negative log-likelihood of the pairwise similarity, the hashing codes can be optimized by Li et al. (2016):

(3)

Combined with Eq. (1), we can find that minimizing Eq. (3) will make the similar image pairs have shorter Hamming distance, and the dissimilar image pairs have larger Hamming distance. Given that the discrete binary hashing codes are not differentiable, a usual solution is replacing the discrete binary codes with continuous real-values. Then, a regularization term is imposed to enforce the real-values to be close to binary ones Li et al. (2016):

(4)

where are continuous real-values, , sign, and is a hyper-parameter.

Discussion:

Obviously, Eq. (4) can be used to learn hashing codes: the first term optimizes the distance in Hamming space and the second term constrains the real-values to approximate the binary codes. However, Eq. (4) treats all hashing bits and input images equally, without considering their differences. As shown in Fig. 1, the hashing network has different uncertainty to the hashing bits and the input images. Therefore, we argue that each hashing bit and each input image should be treated separately according to the magnitude of uncertainty, rather than treated equally.

Uncertainty Estimation

During training the hashing network, the output real-value of each hashing bit usually changes to minimize the objective function Eq. (4). Intuitively, if the real-value of one bit always changes a lot during the optimization, it indicates that the hashing network has high uncertainty to that bit. A naive way to measure the change is to store the output of each bit and then compare the current output with the previous one. However, it is infeasible because the requirement of huge memory when training with large-scale datasets.

Inspired by the recently proposed momentum model in unsupervised and semi-supervised learning

He et al. (2020); Tarvainen and Valpola (2017); French et al. (2018), we introduce a momentum-updated network to estimate the uncertainty. Different from the hashing network that updates its weights via gradient back-propagation, updates by averaging :

(5)

where is a momentum coefficient hyper-parameter, whose value controls the smoothness of . Larger results in smoother . Such an optimization manner can be seen as assembling the hashing networks in different optimization steps to the momentum-updated network French et al. (2018). Therefore, comparing the output of the hashing network and that of the momentum-updated network, we can approximately obtain the change of each bit during training. We regard this change as the uncertainty. That is, if a bit changes a lot, it means that the hashing network has high uncertainty to the current approximate value, and vice versa. Formally, the uncertainty is defined as:

(6)

where is an element-wise absolute value operation. is a vector and each element represents the bit-level uncertainty of the corresponding hashing bit. Furthermore, by counting the average uncertainty of all bits in a hashing code, we can obtain the uncertainty of the input image corresponding to that hashing code:

(7)

where the image-level uncertainty is a single value instead of a vector.

Discussion:

Fig. 1 plots the approximate real-values of two bits at different epochs during training. It is obvious that the two bits have different change trends. After 40 epochs, bit-1 still changes shapely, while bit-2 changes slightly. In addition, the calculated uncertainty values through Eq. (6) of the two bits are 0.073 and 0.005, respectively. We can see that the magnitude of the uncertainty is consistent with the change degree of the approximate real-value. That is, bit-1 has the largest uncertainty and correspondingly has the most drastic value change. Therefore, it is reasonable to leverage the discrepancy between the output of the hashing network and that of the momentum-updated network to represent the uncertainty. Finally, Fig. 1 also shows the images with different image-level uncertainty. We can find that the images with low uncertainty () usually have clear objects and single backgrounds, while the images with high uncertainty () contain more complex scenes. For instance, it is difficult to recognize the frog from the first image in the bottom right corner. These phenomena suggest the relationship between the image-level uncertainty and the input images.

Uncertainty-aware Hashing Learning

After getting the bit-level uncertainty , we leverage it to guide the optimization of the regularization. Rather than treating each bit equally as Eq. (4), we set different weights for different bits according to the magnitude of the uncertainty, yielding a new optimization objective:

(8)

where is multiplied as a weight on the regularization term. The hashing bit with higher uncertainty is given larger weight during regularization. In addition, the image-level uncertainty allows us to set different weights for different input images. We apply larger weights to the images with higher uncertainty in the optimization of Hamming distance. Considering both the uncertainty and of images and , Eq. (8) is reformulated as:

(9)

Finally, we also involve the uncertainty into the optimization objective:

(10)

where is a trade-off parameter. The whole optimization process for the hashing network and the momentum-updated network is summarized in Algorithm 1.

Discussion:

What are the advantages of the uncertainty-aware hashing learning? To begin with, the hard examples can be discovered automatically according to the magnitude of the image-level uncertainty. Benefiting from this, the first term of Eq. (10) can focus on the optimization of hard examples. The effectiveness of such a hard example based approach has been fully proved in previous works Wu et al. (2017). Furthermore, the second term of Eq. (10) assists in stabilizing the outputs of the bits that change frequently. It may accelerate the convergence of the hashing network. Finally, the third term of Eq. (10) minimizes the discrepancy between the outputs of the hashing network and those of the momentum-updated network. Since the momentum-updated network is actually an ensemble of the hashing networks in different optimization steps, the third term of Eq. (10) will help to improve the performance of the hashing network French et al. (2018).

Input:
Training set , semantic similarity
Output:
The weights of the hashing network and those of the momentum-updated network
REPEAT

   Randomly sample a batch of training data with pairwise similarity supervision;
   Compute the outputs of the hashing network and those of the momentum-updated network;
   Compute bit-level uncertainty and image-level uncertainty according to Eq. (6) and Eq. (7), respectively;
   Update according to Eq. (10) with standard gradient back-propagation;
   Update according to Eq. (5);

UNTIL a fixed number of iterations

Algorithm 1 Optimization Algorithm

Experiments

Datasets and Protocols

Four datasets are used to evaluate our proposed method, including two single-label datasets CIFAR-10 Krizhevsky and Hinton (2009) and Clothing1M Xiao et al. (2015), as well as two multi-label datasets NUS-WIDE Chua et al. (2009) and MS-COCO Lin et al. (2014b).

Cifar-10.

It consists of 60,000 color images with 32 32 resolution, belonging to 10 classes (6,000 images per class). Following Li et al. (2016), we randomly sample 1,000 images as the query set and then randomly select 5,000 images from the rest images as the training set.

Clothing1M.

It is a million-level large-scale dataset with a total of 1,037,497 images. The data division, including the query set, the training set, and the database set, is consistent with Jiang et al. (2018). This is the biggest and most challenging dataset used in deep hashing.

Nus-Wide.

It contains 269,648 images collected from Flickr website. Each image belongs to one or multiple class from 81 classes. Following Lai et al. (2015), only 195,834 images associated with the 21 most frequent classes are adopted in our experiments. The division of our query set (2100 images) and training set (10,500 images) are the same as Li et al. (2016).

Ms-Coco.

It has 82,783 training images and 40,504 validation images. Each image belongs to one or multiple class from 91 classes. Following the setting of Jiang and Li (2018), 5,000 images and 10,000 images are randomly selected as the query set and training set, respectively.

Evaluation Methodology.

Following Lai et al. (2015), Mean Average Precision (MAP) is adopted to evaluate the retrieval performance. Particularly, for NUS-WIDE, the MAP is calculated within the top 5,000 returned neighbors. For the single-label CIFAR-10 and Clothing1M datasets, two images will be treated as a similar pair () when they come from the same class, otherwise they are regarded as a dissimilar pair. For the multi-label NUS-WIDE and MS-COCO datasets, two images will be considered to be similar if they share at least one common label.

Experimental Details

For a fair comparison with other state-of-the-art methods, we employ the CNN-F network Chatfield et al. (2014)

pre-trained on ImageNet

Russakovsky et al. (2015)

as the backbone of our hashing network and the momentum-updated network. The classification layer of CNN-F is modified as a fully connected layer with the length of hashing codes (12 bits, 24 bits, 32 bits, and 48 bits, respectively). Stochastic Gradient Descent (SGD) is used as the optimizer with 1e-4 weight decay. The initial learning rate is set to 0.05 and gradually reduced to 0.0005. The batch size is set to 128. The momentum coefficient hyper-parameter

in Eq. (5) is set to 0.9, and the hyper-parameters and in Eq. (10) are set to 50 and 1, respectively. We determine the specific values of and by balancing the magnitude of the corresponding loss term. All experiments are conducted on a single NVIDIA TITAN RTX GPU.

Evaluation of the Uncertainty-aware Hashing

In this subsection, we compare our proposed method against the traditional regularization based method (denoted as Regu), whose optimization objective is Eq. (4). The only difference between our method and Regu is the introduced uncertainty, including its estimation and usage.

Fig. 3 (a) plots the training losses of the two methods under different epochs on the CIFAR-10 dataset. Fig. 3 (b) shows the corresponding MAP values on the training set. Observing the results, we can see that the loss curve of our method converges after about 50 epochs. Meanwhile, the MAP value reaches its peak and remains stable. On the contrast, the loss curve of Regu converges much slower. Especially, the MAP curve of Regu still sharply oscillates at the 90-th epoch. The faster convergence of our method may be due to that we pay more attention on the hashing bits with drastic value changes (according to bit-level uncertainty) and the hard examples (according to image-level uncertainty). In addition, although the two methods have the similar MAP value on the training set, our method gets much better MAP result (0.815) than Regu (0.739) on the testing set. It further shows the excellent generalization performance of our uncertainty based method.

Figure 3: The loss and the MAP curves during training on the CIFAR-10 dataset.

Ablation Study

In this subsection, we compare our method against its three variants to reveal the role of each component. means removing the optimization of uncertainty, i.e. the third term in Eq. (10). In such a case, we no longer minimize the discrepancy between the hashing network and the momentum-updated network. denotes discarding the bit-level uncertainty in the second term of Eq. (10), which means that we ignore the differences among hashing bits. represents removing the image-level uncertainty in the first term of Eq. (10). At this point, all training images are treated equally during training.

Following Jiang and Li (2018), we report Top-5K precision curves to measure the retrieval performance on the CIFAR-10, the NUS-WIDE, and the MS-COCO datasets. The comparison results are reported in Fig. 4. It is obvious that our method obtains the best retrieval performance. The improvements over suggest the impact of the uncertainty minimizing, which transfers the knowledge from the momentum-updated network to the hashing network Tarvainen and Valpola (2017). The gains over demonstrate that the bit-level uncertainty is effective. The bits with drastic value changes are given larger weights to stabilize their outputs. The improvements over prove the validity of the image-level uncertainty. It enables the hard examples to receive more attention during optimization.

Furthermore, we also give detailed parameter analyses. Table 1 reports the parameter study of the trade-off parameter and in Eq. (10), and the momentum coefficient in Eq. (5). From Table 1 (a) and (b), we observe that our method is not sensitive to and in a large range. For example, the MAP value of 24 bits only changes 0.007 when is set from 30 to 70. Table 1 (c) suggests that the most appropriate value of is 0.7.

24 bits 32 bits
30 0.808 0.815
40 0.810 0.818
50 0.815 0.822
60 0.811 0.819
70 0.809 0.817
(a)
24 bits 32 bits
0.2 0.806 0.817
0.5 0.810 0.819
1 0.815 0.822
2 0.809 0.814
3 0.807 0.810
(b)
24 bits 32 bits
0.5 0.807 0.813
0.6 0.810 0.817
0.7 0.815 0.822
0.8 0.813 0.819
0.9 0.810 0.815
(c)
Table 1: Parameter analyses on the CIFAR-10 dataset, including and in Eq. (10), and in Eq. (5).
Method CIFAR-10 NUS-WIDE MS-COCO Clothing1M
12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits
DMUH 0.772 0.815 0.822 0.826 0.792 0.818 0.825 0.829 0.761 0.779 0.785 0.788 0.315 0.371 0.389 0.401
DDSH 0.753 0.776 0.803 0.811 0.776 0.803 0.810 0.817 0.745 0.765 0.771 0.774 0.271 0.332 0.343 0.346
DSDH 0.740 0.774 0.792 0.813 0.774 0.801 0.813 0.819 0.743 0.762 0.765 0.769 0.278 0.302 0.311 0.319
DPSH 0.712 0.725 0.742 0.752 0.768 0.793 0.807 0.812 0.741 0.759 0.763 0.771 0.193 0.204 0.213 0.215
DSH 0.644 0.742 0.770 0.799 0.712 0.731 0.740 0.748 0.696 0.717 0.715 0.722 0.173 0.187 0.191 0.202
DHN 0.680 0.721 0.723 0.733 0.771 0.801 0.805 0.814 0.744 0.765 0.769 0.774 0.190 0.224 0.212 0.248
COSDISH 0.583 0.661 0.680 0.701 0.642 0.740 0.784 0.796 0.689 0.692 0.731 0.758 0.187 0.235 0.256 0.275
SDH 0.453 0.633 0.651 0.660 0.764 0.799 0.801 0.812 0.695 0.707 0.711 0.716 0.151 0.186 0.194 0.197
FastH 0.597 0.663 0.684 0.702 0.726 0.769 0.781 0.803 0.719 0.747 0.754 0.760 0.173 0.206 0.216 0.244
LFH 0.417 0.573 0.641 0.692 0.711 0.768 0.794 0.813 0.708 0.738 0.758 0.772 0.154 0.159 0.212 0.257
ITQ 0.261 0.275 0.286 0.294 0.714 0.736 0.745 0.755 0.633 0.632 0.630 0.633 0.115 0.121 0.122 0.125
Table 2: MAP of different methods on the CIFAR-10, the NUS-WIDE, the MS-COCO, and the Clothing1M datasets. For the NUS-WIDE dataset, the MAP is calculated within the top 5,000 returned neighbors.
Figure 4: Top-5K precision on the CIFAR-10, the NUS-WIDE, and the MS-COCO datasets.

Comparisons with the State-of-the-arts

Our method is compared with several state-of-the-art hashing methods from three types of learning manners. For the unsupervised learning manner, the compared method is iterative quantization (ITQ) Gong and Lazebnik (2011). For the non-deep supervised learning manner, the compared methods include column sampling based discrete supervised hashing (COSDISH) Kang et al. (2016), supervised discrete hashing (SDH) Shen et al. (2015), fast supervised hashing (FastH) Lin et al. (2014a), and latent factor hashing (LFH) Zhang et al. (2014). For the deep supervised learning manner, the compared methods consist of deep supervised discrete hashing (DSDH) Li et al. (2020), deep discrete supervised hashing (DDSH) Jiang et al. (2018), deep pairwise-supervised hashing (DPSH) Li et al. (2016), deep supervised hashing (DSH) Liu et al. (2016a), and deep hashing network (DHN) Zhu et al. (2016).

The comparison results of our method against the above state-of-the-arts are tabulated in Table 2

, from which we have three observations. First, the unsupervised method ITQ lags behind all the supervised methods, showing the advantage of supervised labels. Second, the performance of deep supervised hashing methods is generally better than that of the non-deep supervised hashing methods. It indicates that the features extracted by deep neural networks are better than the hand-crafted features. Third, our method gets the highest retrieval accuracy on all datasets. For example, on the CIFAR-10 dataset, our method surpasses the state-of-the-art DDSH by 3.9% at 24 bits. On the NUS-WIDE dataset, we improve the best MAP values of all bits at least 1.2%. On the MS-COCO dataset, we also get 1.6% improvement at 12 bits. Specially, on the large-scale Clothing1M dataset, the MAP value is improved by

3.7%, 3.9%, 4.6%, and 5.5% in terms of 12, 24, 32, and 48 bits, respectively. The compared state-of-the-art deep hashing methods adopt similar binary approximation that treats all hashing bits equally, such as DSDH using Tanh and DPSH using the regularization. Given this, we owe the gains of our method over the competitors to the proposed uncertainty-aware learning manner, which applies different weights for different hashing bits and inputs images.

Conclusion

In this paper, we have proposed an uncertainty-aware deep supervised hashing that is named as DMUH. To begin with, we discover that the hashing network has different uncertainty to different approximate binary hashing bits. According to this, we propose that hashing bits should be paid disparate attention during training, rather than being treated equally. In addition, we introduce a momentum-updated network to assist in estimating such uncertainty, including both bit-level uncertainty and image-level uncertainty. The former is utilized to guide the regularization of hashing bits and the latter is leveraged to assist in optimizing Hamming distance. Extensive experiments on four datasets demonstrate the superiority of our proposed method, especially on the million-scale Clothing1M dataset.

References