Introduction
With the explosive growth of data in practical applications, hashing has received sustained attention due to its advantages in low storage cost and fast computation speed Jiang et al. (2019); Wang et al. (2018). Traditional hashing methods are based on the elaborately designed handcrafted features Lin et al. (2014a). The binary codes are learned from data distributions Gong and Lazebnik (2011) or obtained by random projection Gionis et al. (1999)
. In recent years, as the thriving of deep learning, deep supervised hashing that combines hashing with deep learning further improves retrieval performance
Zhu and Gao (2017). Generally, the last layer of a neural network is leveraged to output binary hashing codes
Fu et al. (2019). Early works, such as Convolutional Neural Network Hashing (CNNH)
Xia et al. (2014), adopt a twostage manner. The feature learning of the neural network and the hashing coding are separate. Subsequent works, e.g., Deep PairwiseSupervised Hashing (DPSH) Li et al. (2016), perform feature learning and hashing coding in an endtoend framework, which has shown better performance than the twostage manner. For all the deep hashing methods, an intractable problem is that the binary hashing code is discrete, which impedes the backpropagation of the gradient in the neural network Jiang et al. (2018). How to solve the discrete optimization of the binary code remains a challenge.Previous methods usually adopt binary approximation to tackle the above challenge. That is, the binary codes are replaced by continuous realvalues, which are enforced to be binary via nonlinear activation functions Jiang et al. (2018). Nevertheless, the output of the activation function, such as Sigmoid or Tanh, is easy to be saturated. It inevitably slows down or even limits the training process Liu et al. (2016a). Considering the saturating problem, some recent methods desert the nonlinear activation function and impose a regularization on the output to enforce the realvalue of each bit to be close to a binary one (+1 or 1) Li et al. (2016). However, these methods equally approximate all bits, while ignore their differences. As shown in Fig. 1, we discover that the approximate output of each bit has a unique change trend. It is obvious that the output of bit1 has larger change than the output of bit2 during training. That is to say, the hashing network has higher uncertainty to the approximate output of bit1. We call such uncertainty for each bit as bitlevel uncertainty. Furthermore, if all bits of a hashing code generally have high uncertainty, it indicates that the hashing network has high uncertainty to the corresponding input image. We define the mean bitlevel uncertainty of all bits in a hashing code as the imagelevel uncertainty. As can be seen from Fig. 1, the images with high imagelevel uncertainty usually contain more complex scenarios, belonging to hard examples Wu et al. (2017).
In order to explicitly estimate the bitlevel uncertainty, i.e. the change trends of bits, we need to compare current values with previous ones. A straightforward idea is to store the outputs of all training images in each optimization step and then compare the current outputs with them. Unfortunately, it is unfeasible because of the requirement of huge memory when training with largescale datasets. Recently, in order to tackle the memory problem in unsupervised and semisupervised learning
Wu et al. (2018), He et al. (2020); Tarvainen and Valpola (2017); French et al. (2018) propose an extra momentumupdated network that averages model weights during training. The momentumupdated network is an ensemble of previous networks in different optimization steps, outputting ensemble results French et al. (2018). Inspired by this, a momentumupdated network is introduced to obtain previous outputs approximately. As far as we know, it is the first time to introduce the momentumupdated network for uncertainty estimation. We further compare the outputs between the hashing network and the momentumupdated network, and regard the discrepancy as the bitlevel uncertainty. According to the magnitude of the uncertainty, we set different regularization weights for different hashing bits. In addition, by averaging the uncertainty of all bits in a hashing code, we get the imagelevel uncertainty of the corresponding input image. The image with higher uncertainty is paid more attention during the optimization of Hamming distance. The effectiveness of our method is demonstrated on four challenging datasets, including CIFAR10 Krizhevsky and Hinton (2009), NUSWIDE Chua et al. (2009), MSCOCO Lin et al. (2014b), and a millionscale dataset Clothing1M Xiao et al. (2015). In summary, the main contributions of our work are as follows:
We are the first to explore the uncertainty of hashing bits during approximate optimization. Depending on the magnitude of uncertainty, the corresponding hashing bits and input images receive different attention.

We propose to explicitly model bitlevel and imagelevel uncertainty, resorting to the discrepancy between the output of the hashing network and that of the momentumupdated network.

Extensive experiments on the CIFAR10, the NUSWIDE, the MSCOCO, and the largescale Clothing1M datasets show that our method significantly improves the retrieval performance when compared with stateoftheart methods.
Related Work
Hashing Retrieval
Hashing aims at projecting data from highdimensional pixel space into the lowdimensional binary Hamming space. It has drawn substantial attention of researchers due to the low time and space complexity. Current hashing can be grouped into two categories, including dataindependent hashing methods and datadependent hashing methods. For the dataindependent hashing, the binary hashing codes are generated by random projection or manually constructed, such as the locality sensitive hashing (LSH) Gionis et al. (1999). Since dataindependent hashing usually requires long code length to guarantee the retrieval performance, the more efficient datadependent hashing that leans hashing codes from data has gained more attention in recent years Jiang and Li (2018).
The datadependent hashing can be further divided into unsupervised hashing and supervised hashing, according to whether use the supervised similarity labels or not. Iterative quantization hashing (ITQ) Gong and Lazebnik (2011) and ordinal embedding hashing (OEH) Liu et al. (2016b)
are representative unsupervised hashing methods. Both of them retrieval the neighbors by exploring the metric structure in the data. Though unsupervised learning avoids the annotation demand of training data, exploiting the available supervisory information usually implies a better performance. Representative supervised hashing methods based on handcrafted features include supervised hashing with kernels (KSH)
Liu et al. (2012), latent factor hashing (LFH) Zhang et al. (2014) and columnsampling based discrete supervised hashing (COSDISH) Kang et al. (2016), all of which achieve impressive results.Benefiting from the powerful representation ability of deep neural networks, supervised hashing has made great progress in the last few years Lai et al. (2015)
. The deep learning based supervised hashing is called deep supervised hashing, which is a hot research direction in the community of machine learning and computer vision. Convolutional neural network hashing (CNNH)
Xia et al. (2014) and deep pairwisesupervised hashing (DPSH) Li et al. (2016) are representative methods. Other recent works include Huang et al. (2019); Chen et al. (2019); Yang et al. (2019); Lin et al. (2019); Shen et al. (2020); Cui et al. (2020)Uncertainty in Deep Learning
Here, uncertainty means the uncertainty of the deep neural network to the current outputs. For traditional deep learning, the network only outputs a deterministic result. However, in many scenarios, such as autonomous driving, we would like to be able to simultaneously obtain the uncertainty of the network to that output. It will facilitate reliability assessment and riskbased decision Chang et al. (2020). Therefore, uncertainty has received much attention in recent years Gal and Ghahramani (2016); Kendall et al. (2015); Kendall and Gal (2017). Gal and Ghahramani (2016)
develops an approximate Bayesian inference framework to represent model uncertainty, which denotes the uncertainty existed in model parameters.
Kendall and Gal (2017) proposes to estimate model uncertainty and data uncertainty (existed in the training data) in a unified framework. Although the uncertainty has been widely explored in various tasks, including object detection Choi et al. (2019), semantic segmentation Kendall et al. (2015), domain adaption Zheng and Yang (2020)Shi and Jain (2019), and 3D deformable learning Wu et al. (2020), they are unsuitable for deep hashing because its unique binary property.Preliminaries
Notation
Uppercase letters such as are used to denote matrices, and lowercase letters like are used to denote the th element of . indicates the transpose of the matrix .
denotes the elementwise product of two vectors
and . sign() means the elementwise sign function, which returns and when the element is positive and negative, respectively.Problem Definition
Suppose there are images , where means the th image. For deep supervised hashing, the pairwise similarity between two images is also available. The similarity information is denoted as with . means and are similar, while means and are dissimilar.
The purpose of deep supervised hashing is learning a function that maps the data from highdimensional pixel space to the lowdimensional binary Hamming space. That is, for each image , we can get a binary hashing code , where means that the code has bits. Meanwhile, the semantic similarity should be consistent before and after mapping. For example, if , and should have shorter Hamming distance. Otherwise if , and should have longer Hamming distance. Hamming distance of two binary codes is defined as:
(1) 
It is obvious that the product of and can be used to measure Hamming distance.
Method
Overall Framework
As shown in Fig. 2, our framework consists of two networks: a hashing network and a momentumupdated network . and are the weights of the two networks, respectively. Moreover, the two networks have the same architecture: a backbone for feature learning as well as a fullyconnected layer for approximate binary coding. The difference is that the weights are updated via the backpropagation of gradient, while the weights are updated by averaging .
Given an input image , the two networks output approximate binary values and , respectively. Since can be seen as an ensemble of French et al. (2018), we can approximatively calculate the change of via comparing with . As mentioned in Introduction, such change is regarded as the uncertainty of the hashing network to the current output value. For each bit, the larger the difference between and , the more uncertainty the hashing network to that bit. By this means, we can get the bitlevel uncertainty. For instance, as shown in Fig. 2, it is obvious that the hashing network is more uncertain about the output of the third bit. In addition, by averaging the uncertainty values of all bits of in a hashing code, we can obtain the imagelevel uncertainty. It represents the uncertainty of the hashing network to the corresponding input image. After obtaining the bitlevel and the imagelevel uncertainty, we will set different attention for different bits and images during training. The detailed optimization is introduced in the following part.
Hashing Learning Revisit
Given the binary hashing codes , the likelihood of the pairwise similarity is formulated as Zhang et al. (2014); Li et al. (2016):
(2) 
where and . Considering the negative loglikelihood of the pairwise similarity, the hashing codes can be optimized by Li et al. (2016):
(3) 
Combined with Eq. (1), we can find that minimizing Eq. (3) will make the similar image pairs have shorter Hamming distance, and the dissimilar image pairs have larger Hamming distance. Given that the discrete binary hashing codes are not differentiable, a usual solution is replacing the discrete binary codes with continuous realvalues. Then, a regularization term is imposed to enforce the realvalues to be close to binary ones Li et al. (2016):
(4) 
where are continuous realvalues, , sign, and is a hyperparameter.
Discussion:
Obviously, Eq. (4) can be used to learn hashing codes: the first term optimizes the distance in Hamming space and the second term constrains the realvalues to approximate the binary codes. However, Eq. (4) treats all hashing bits and input images equally, without considering their differences. As shown in Fig. 1, the hashing network has different uncertainty to the hashing bits and the input images. Therefore, we argue that each hashing bit and each input image should be treated separately according to the magnitude of uncertainty, rather than treated equally.
Uncertainty Estimation
During training the hashing network, the output realvalue of each hashing bit usually changes to minimize the objective function Eq. (4). Intuitively, if the realvalue of one bit always changes a lot during the optimization, it indicates that the hashing network has high uncertainty to that bit. A naive way to measure the change is to store the output of each bit and then compare the current output with the previous one. However, it is infeasible because the requirement of huge memory when training with largescale datasets.
Inspired by the recently proposed momentum model in unsupervised and semisupervised learning
He et al. (2020); Tarvainen and Valpola (2017); French et al. (2018), we introduce a momentumupdated network to estimate the uncertainty. Different from the hashing network that updates its weights via gradient backpropagation, updates by averaging :(5) 
where is a momentum coefficient hyperparameter, whose value controls the smoothness of . Larger results in smoother . Such an optimization manner can be seen as assembling the hashing networks in different optimization steps to the momentumupdated network French et al. (2018). Therefore, comparing the output of the hashing network and that of the momentumupdated network, we can approximately obtain the change of each bit during training. We regard this change as the uncertainty. That is, if a bit changes a lot, it means that the hashing network has high uncertainty to the current approximate value, and vice versa. Formally, the uncertainty is defined as:
(6) 
where is an elementwise absolute value operation. is a vector and each element represents the bitlevel uncertainty of the corresponding hashing bit. Furthermore, by counting the average uncertainty of all bits in a hashing code, we can obtain the uncertainty of the input image corresponding to that hashing code:
(7) 
where the imagelevel uncertainty is a single value instead of a vector.
Discussion:
Fig. 1 plots the approximate realvalues of two bits at different epochs during training. It is obvious that the two bits have different change trends. After 40 epochs, bit1 still changes shapely, while bit2 changes slightly. In addition, the calculated uncertainty values through Eq. (6) of the two bits are 0.073 and 0.005, respectively. We can see that the magnitude of the uncertainty is consistent with the change degree of the approximate realvalue. That is, bit1 has the largest uncertainty and correspondingly has the most drastic value change. Therefore, it is reasonable to leverage the discrepancy between the output of the hashing network and that of the momentumupdated network to represent the uncertainty. Finally, Fig. 1 also shows the images with different imagelevel uncertainty. We can find that the images with low uncertainty () usually have clear objects and single backgrounds, while the images with high uncertainty () contain more complex scenes. For instance, it is difficult to recognize the frog from the first image in the bottom right corner. These phenomena suggest the relationship between the imagelevel uncertainty and the input images.
Uncertaintyaware Hashing Learning
After getting the bitlevel uncertainty , we leverage it to guide the optimization of the regularization. Rather than treating each bit equally as Eq. (4), we set different weights for different bits according to the magnitude of the uncertainty, yielding a new optimization objective:
(8) 
where is multiplied as a weight on the regularization term. The hashing bit with higher uncertainty is given larger weight during regularization. In addition, the imagelevel uncertainty allows us to set different weights for different input images. We apply larger weights to the images with higher uncertainty in the optimization of Hamming distance. Considering both the uncertainty and of images and , Eq. (8) is reformulated as:
(9) 
Finally, we also involve the uncertainty into the optimization objective:
(10)  
where is a tradeoff parameter. The whole optimization process for the hashing network and the momentumupdated network is summarized in Algorithm 1.
Discussion:
What are the advantages of the uncertaintyaware hashing learning? To begin with, the hard examples can be discovered automatically according to the magnitude of the imagelevel uncertainty. Benefiting from this, the first term of Eq. (10) can focus on the optimization of hard examples. The effectiveness of such a hard example based approach has been fully proved in previous works Wu et al. (2017). Furthermore, the second term of Eq. (10) assists in stabilizing the outputs of the bits that change frequently. It may accelerate the convergence of the hashing network. Finally, the third term of Eq. (10) minimizes the discrepancy between the outputs of the hashing network and those of the momentumupdated network. Since the momentumupdated network is actually an ensemble of the hashing networks in different optimization steps, the third term of Eq. (10) will help to improve the performance of the hashing network French et al. (2018).
Experiments
Datasets and Protocols
Four datasets are used to evaluate our proposed method, including two singlelabel datasets CIFAR10 Krizhevsky and Hinton (2009) and Clothing1M Xiao et al. (2015), as well as two multilabel datasets NUSWIDE Chua et al. (2009) and MSCOCO Lin et al. (2014b).
Cifar10.
It consists of 60,000 color images with 32 32 resolution, belonging to 10 classes (6,000 images per class). Following Li et al. (2016), we randomly sample 1,000 images as the query set and then randomly select 5,000 images from the rest images as the training set.
Clothing1M.
It is a millionlevel largescale dataset with a total of 1,037,497 images. The data division, including the query set, the training set, and the database set, is consistent with Jiang et al. (2018). This is the biggest and most challenging dataset used in deep hashing.
NusWide.
It contains 269,648 images collected from Flickr website. Each image belongs to one or multiple class from 81 classes. Following Lai et al. (2015), only 195,834 images associated with the 21 most frequent classes are adopted in our experiments. The division of our query set (2100 images) and training set (10,500 images) are the same as Li et al. (2016).
MsCoco.
It has 82,783 training images and 40,504 validation images. Each image belongs to one or multiple class from 91 classes. Following the setting of Jiang and Li (2018), 5,000 images and 10,000 images are randomly selected as the query set and training set, respectively.
Evaluation Methodology.
Following Lai et al. (2015), Mean Average Precision (MAP) is adopted to evaluate the retrieval performance. Particularly, for NUSWIDE, the MAP is calculated within the top 5,000 returned neighbors. For the singlelabel CIFAR10 and Clothing1M datasets, two images will be treated as a similar pair () when they come from the same class, otherwise they are regarded as a dissimilar pair. For the multilabel NUSWIDE and MSCOCO datasets, two images will be considered to be similar if they share at least one common label.
Experimental Details
For a fair comparison with other stateoftheart methods, we employ the CNNF network Chatfield et al. (2014)
pretrained on ImageNet
Russakovsky et al. (2015)as the backbone of our hashing network and the momentumupdated network. The classification layer of CNNF is modified as a fully connected layer with the length of hashing codes (12 bits, 24 bits, 32 bits, and 48 bits, respectively). Stochastic Gradient Descent (SGD) is used as the optimizer with 1e4 weight decay. The initial learning rate is set to 0.05 and gradually reduced to 0.0005. The batch size is set to 128. The momentum coefficient hyperparameter
in Eq. (5) is set to 0.9, and the hyperparameters and in Eq. (10) are set to 50 and 1, respectively. We determine the specific values of and by balancing the magnitude of the corresponding loss term. All experiments are conducted on a single NVIDIA TITAN RTX GPU.Evaluation of the Uncertaintyaware Hashing
In this subsection, we compare our proposed method against the traditional regularization based method (denoted as Regu), whose optimization objective is Eq. (4). The only difference between our method and Regu is the introduced uncertainty, including its estimation and usage.
Fig. 3 (a) plots the training losses of the two methods under different epochs on the CIFAR10 dataset. Fig. 3 (b) shows the corresponding MAP values on the training set. Observing the results, we can see that the loss curve of our method converges after about 50 epochs. Meanwhile, the MAP value reaches its peak and remains stable. On the contrast, the loss curve of Regu converges much slower. Especially, the MAP curve of Regu still sharply oscillates at the 90th epoch. The faster convergence of our method may be due to that we pay more attention on the hashing bits with drastic value changes (according to bitlevel uncertainty) and the hard examples (according to imagelevel uncertainty). In addition, although the two methods have the similar MAP value on the training set, our method gets much better MAP result (0.815) than Regu (0.739) on the testing set. It further shows the excellent generalization performance of our uncertainty based method.
Ablation Study
In this subsection, we compare our method against its three variants to reveal the role of each component. means removing the optimization of uncertainty, i.e. the third term in Eq. (10). In such a case, we no longer minimize the discrepancy between the hashing network and the momentumupdated network. denotes discarding the bitlevel uncertainty in the second term of Eq. (10), which means that we ignore the differences among hashing bits. represents removing the imagelevel uncertainty in the first term of Eq. (10). At this point, all training images are treated equally during training.
Following Jiang and Li (2018), we report Top5K precision curves to measure the retrieval performance on the CIFAR10, the NUSWIDE, and the MSCOCO datasets. The comparison results are reported in Fig. 4. It is obvious that our method obtains the best retrieval performance. The improvements over suggest the impact of the uncertainty minimizing, which transfers the knowledge from the momentumupdated network to the hashing network Tarvainen and Valpola (2017). The gains over demonstrate that the bitlevel uncertainty is effective. The bits with drastic value changes are given larger weights to stabilize their outputs. The improvements over prove the validity of the imagelevel uncertainty. It enables the hard examples to receive more attention during optimization.
Furthermore, we also give detailed parameter analyses. Table 1 reports the parameter study of the tradeoff parameter and in Eq. (10), and the momentum coefficient in Eq. (5). From Table 1 (a) and (b), we observe that our method is not sensitive to and in a large range. For example, the MAP value of 24 bits only changes 0.007 when is set from 30 to 70. Table 1 (c) suggests that the most appropriate value of is 0.7.



Method  CIFAR10  NUSWIDE  MSCOCO  Clothing1M  

12 bits  24 bits  32 bits  48 bits  12 bits  24 bits  32 bits  48 bits  12 bits  24 bits  32 bits  48 bits  12 bits  24 bits  32 bits  48 bits  
DMUH  0.772  0.815  0.822  0.826  0.792  0.818  0.825  0.829  0.761  0.779  0.785  0.788  0.315  0.371  0.389  0.401 
DDSH  0.753  0.776  0.803  0.811  0.776  0.803  0.810  0.817  0.745  0.765  0.771  0.774  0.271  0.332  0.343  0.346 
DSDH  0.740  0.774  0.792  0.813  0.774  0.801  0.813  0.819  0.743  0.762  0.765  0.769  0.278  0.302  0.311  0.319 
DPSH  0.712  0.725  0.742  0.752  0.768  0.793  0.807  0.812  0.741  0.759  0.763  0.771  0.193  0.204  0.213  0.215 
DSH  0.644  0.742  0.770  0.799  0.712  0.731  0.740  0.748  0.696  0.717  0.715  0.722  0.173  0.187  0.191  0.202 
DHN  0.680  0.721  0.723  0.733  0.771  0.801  0.805  0.814  0.744  0.765  0.769  0.774  0.190  0.224  0.212  0.248 
COSDISH  0.583  0.661  0.680  0.701  0.642  0.740  0.784  0.796  0.689  0.692  0.731  0.758  0.187  0.235  0.256  0.275 
SDH  0.453  0.633  0.651  0.660  0.764  0.799  0.801  0.812  0.695  0.707  0.711  0.716  0.151  0.186  0.194  0.197 
FastH  0.597  0.663  0.684  0.702  0.726  0.769  0.781  0.803  0.719  0.747  0.754  0.760  0.173  0.206  0.216  0.244 
LFH  0.417  0.573  0.641  0.692  0.711  0.768  0.794  0.813  0.708  0.738  0.758  0.772  0.154  0.159  0.212  0.257 
ITQ  0.261  0.275  0.286  0.294  0.714  0.736  0.745  0.755  0.633  0.632  0.630  0.633  0.115  0.121  0.122  0.125 
Comparisons with the Stateofthearts
Our method is compared with several stateoftheart hashing methods from three types of learning manners. For the unsupervised learning manner, the compared method is iterative quantization (ITQ) Gong and Lazebnik (2011). For the nondeep supervised learning manner, the compared methods include column sampling based discrete supervised hashing (COSDISH) Kang et al. (2016), supervised discrete hashing (SDH) Shen et al. (2015), fast supervised hashing (FastH) Lin et al. (2014a), and latent factor hashing (LFH) Zhang et al. (2014). For the deep supervised learning manner, the compared methods consist of deep supervised discrete hashing (DSDH) Li et al. (2020), deep discrete supervised hashing (DDSH) Jiang et al. (2018), deep pairwisesupervised hashing (DPSH) Li et al. (2016), deep supervised hashing (DSH) Liu et al. (2016a), and deep hashing network (DHN) Zhu et al. (2016).
The comparison results of our method against the above stateofthearts are tabulated in Table 2
, from which we have three observations. First, the unsupervised method ITQ lags behind all the supervised methods, showing the advantage of supervised labels. Second, the performance of deep supervised hashing methods is generally better than that of the nondeep supervised hashing methods. It indicates that the features extracted by deep neural networks are better than the handcrafted features. Third, our method gets the highest retrieval accuracy on all datasets. For example, on the CIFAR10 dataset, our method surpasses the stateoftheart DDSH by 3.9% at 24 bits. On the NUSWIDE dataset, we improve the best MAP values of all bits at least 1.2%. On the MSCOCO dataset, we also get 1.6% improvement at 12 bits. Specially, on the largescale Clothing1M dataset, the MAP value is improved by
3.7%, 3.9%, 4.6%, and 5.5% in terms of 12, 24, 32, and 48 bits, respectively. The compared stateoftheart deep hashing methods adopt similar binary approximation that treats all hashing bits equally, such as DSDH using Tanh and DPSH using the regularization. Given this, we owe the gains of our method over the competitors to the proposed uncertaintyaware learning manner, which applies different weights for different hashing bits and inputs images.Conclusion
In this paper, we have proposed an uncertaintyaware deep supervised hashing that is named as DMUH. To begin with, we discover that the hashing network has different uncertainty to different approximate binary hashing bits. According to this, we propose that hashing bits should be paid disparate attention during training, rather than being treated equally. In addition, we introduce a momentumupdated network to assist in estimating such uncertainty, including both bitlevel uncertainty and imagelevel uncertainty. The former is utilized to guide the regularization of hashing bits and the latter is leveraged to assist in optimizing Hamming distance. Extensive experiments on four datasets demonstrate the superiority of our proposed method, especially on the millionscale Clothing1M dataset.
References
 Data uncertainty learning in face recognition. In CVPR, Cited by: Uncertainty in Deep Learning.
 Return of the devil in the details: delving deep into convolutional nets. In BMVC, Cited by: Experimental Details.
 Deep supervised hashing with anchor graph. In ICCV, Cited by: Hashing Retrieval.
 Gaussian yolov3: an accurate and fast object detector using localization uncertainty for autonomous driving. In ICCV, Cited by: Uncertainty in Deep Learning.
 NUSwide: a realworld web image database from national university of singapore. In CIVR, Cited by: Introduction, Datasets and Protocols.

ExchNet: a unified hashing network for largescale finegrained image retrieval
. In ECCV, Cited by: Hashing Retrieval.  Selfensembling for visual domain adaptation. In ICLR, Cited by: Introduction, Overall Framework, Uncertainty Estimation, Discussion:.
 Neurons merging layer: towards progressive redundancy reduction for deep supervised hashing. In IJCAI, Cited by: Introduction.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In ICML, Cited by: Uncertainty in Deep Learning.
 Similarity search in high dimensions via hashing. In VLDB, Cited by: Introduction, Hashing Retrieval.
 Iterative quantization: a procrustean approach to learning binary codes. In CVPR, Cited by: Introduction, Hashing Retrieval, Comparisons with the Stateofthearts.
 Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: Introduction, Uncertainty Estimation.
 Accelerate learning of deep hashing with gradient attention. In ICCV, Cited by: Hashing Retrieval.
 Deep discrete supervised hashing. TIP. Cited by: Introduction, Introduction, Clothing1M., Comparisons with the Stateofthearts.
 SVD: a largescale short video dataset for nearduplicate video retrieval. In ICCV, Cited by: Introduction.
 Asymmetric deep supervised hashing. In AAAI, Cited by: Hashing Retrieval, MSCOCO., Ablation Study.
 Column sampling based discrete supervised hashing.. In AAAI, Cited by: Hashing Retrieval, Comparisons with the Stateofthearts.

Bayesian segnet: model uncertainty in deep convolutional encoderdecoder architectures for scene understanding
. In BMVC, Cited by: Uncertainty in Deep Learning.  What uncertainties do we need in bayesian deep learning for computer vision?. In NeurIPS, Cited by: Uncertainty in Deep Learning.
 Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto. Cited by: Introduction, Datasets and Protocols.
 Simultaneous feature learning and hash coding with deep neural networks. In CVPR, Cited by: Hashing Retrieval, NUSWIDE., Evaluation Methodology..
 A general framework for deep supervised discrete hashing. IJCV. Cited by: Comparisons with the Stateofthearts.
 Feature learning based deep supervised hashing with pairwise labels. In IJCAI, Cited by: Introduction, Introduction, Hashing Retrieval, Hashing Learning Revisit, CIFAR10., NUSWIDE., Comparisons with the Stateofthearts.

Fast supervised hashing with decision trees for highdimensional data
. In CVPR, Cited by: Introduction, Comparisons with the Stateofthearts.  Towards optimal discrete online hashing with balanced similarity. In AAAI, Cited by: Hashing Retrieval.
 Microsoft coco: common objects in context. In ECCV, Cited by: Introduction, Datasets and Protocols.
 Deep supervised hashing for fast image retrieval. In CVPR, Cited by: Introduction, Comparisons with the Stateofthearts.
 Towards optimal binary code learning via ordinal embedding.. In AAAI, Cited by: Hashing Retrieval.
 Supervised hashing with kernels. In CVPR, Cited by: Hashing Retrieval.
 Imagenet large scale visual recognition challenge. IJCV. Cited by: Experimental Details.
 Supervised discrete hashing. In CVPR, Cited by: Comparisons with the Stateofthearts.
 Autoencoding twinbottleneck hashing. In CVPR, Cited by: Hashing Retrieval.
 Probabilistic face embeddings. In ICCV, Cited by: Uncertainty in Deep Learning.
 Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results. In NeurIPS, Cited by: Introduction, Uncertainty Estimation, Ablation Study.
 A survey on learning to hash. TPAMI. Cited by: Introduction.
 Sampling matters in deep embedding learning. In ICCV, Cited by: Introduction, Discussion:.

Unsupervised learning of probably symmetric deformable 3d objects from images in the wild
. In CVPR, Cited by: Uncertainty in Deep Learning.  Unsupervised feature learning via nonparametric instance discrimination. In CVPR, Cited by: Introduction.
 Supervised hashing for image retrieval via image representation learning.. In AAAI, Cited by: Introduction, Hashing Retrieval.
 Learning from massive noisy labeled data for image classification. In CVPR, Cited by: Introduction, Datasets and Protocols.
 DistillHash: unsupervised deep hashing by distilling data pairs. In CVPR, Cited by: Hashing Retrieval.
 Supervised hashing with latent factor models. In SIGIR, Cited by: Hashing Retrieval, Hashing Learning Revisit, Comparisons with the Stateofthearts.
 Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. arXiv:2003.03773. Cited by: Uncertainty in Deep Learning.
 Deep hashing network for efficient similarity retrieval.. In AAAI, Cited by: Comparisons with the Stateofthearts.
 Localityconstrained deep supervised hashing for image retrieval. In IJCAI, Cited by: Introduction.