1 Introduction
The explosive growth of visual data (e.g., photos and videos) has led to renewed interest in efficient indexing and searching algorithms [63, 51, 56, 33, 44, 62, 43, 75, 36, 44, 47, 9, 61, 57, 6, 19, 70, 71, 72], among which, hashingbased approximate nearest neighbor (ANN) searching, which balances retrieval quality and computational cost, has attracted increasing attention.
Generally, hashing methods can be divided into supervised and unsupervised models. The supervised hashing models [41, 35, 7, 67], which aim to learn hash functions with semantic labels, have shown remarkable performance. However, existing supervised hashing methods, especially deep hashing rely on massive labeled data examples to train their models. Thus, when there exist no enough training examples, their performance may be dramatically degraded caused by overfitting to those training examples.
To address this challenge, unsupervised hashing methods usually adopt learning frameworks without requiring any supervised information. Traditional unsupervised hashing methods [3, 21, 40, 25]
with handcrafted features cannot well preserve the similarities of realworld data samples due to the low model capacity and the separated representation and binary codes optimization processes. To take advantages of the recent progress of deep learning
[31, 60, 73], unsupervised deep hashing methods [30, 37, 11, 17, 15], which adopt neural networks as hash functions, have also been proposed. These deep hashing models are usually trained by minimizing either the quantization loss or data reconstruction loss. However, since these objectives fail to exploit the semantic similarities between data points, they can hardly achieve satisfactory results.
In this paper, we propose a novel unsupervised deep hashing model, dubbed DistillHash, which addresses the absence of supervisory signals by distilling data pairs with confident semantic similarity relationships. In particular, we first exploit the local structure of training data points to assign an initial similarity label for each data pair. If we treat the semantic similarity labels as true labels, these initial similarity labels then contain label and instancedependent label noise, because many of them fail to represent semantic similarities. By assuming that we know the probability of a semantic similarity label given a pair of the data points, the Bayes optimal classifier will assign the semantic similarity label to the data pair which has a higher probability (or has a probability greater than 0.5). Based on these results, we give strict analysis on the relationship between the noisy labels and the labels assigned by the Bayes optimal classifier. Inspired by the framework of
[8], we show that, under a mild assumption, data pairs with confident semantic labels can be potentially distilled. Furthermore, we theoretically give the criteria to select distilled data pairs and also provide a simple but effective method to collect distilled data pairs automatically. Finally, given the distilled data pair set, we design a deep neural network and adopt a Bayesian learning framework to perform the representation and hash code learning simultaneously.Our main contributions can be summarized as follows:

By considering signals learned from deep features as noisy pairwise labels, we successfully apply noisy label learning techniques to our method. This shows that data pairs, of which labels are consistent with those assigned by the Bayes optimal classifier, can be potentially distilled.

We theoretically give the criteria to select distilled data pairs for hash learning and further provide a simple but effective method to collect distilled data pairs automatically.

Experiments on three popular benchmark datasets show that our method can outperform current stateoftheart unsupervised hashing methods.
2 Related Work
Recently, the amount of literatures have grown up considerably around the theme of hashing [45, 12, 32, 13, 45, 46]. According to whether supervised information are involved in the learning phase, existing hashing models can be divided into two categories: supervised hashing methods and unsupervised hashing methods.
Supervised hashing methods [41, 22, 66, 35, 5, 52, 59, 14, 69] aim to learn hash functions that can map data points to Hamming space where the semantic similarity can be preserved. Kernelbased supervised hashing (KSH) [41] uses inner products to approximate the Hamming distance and learns hash functions by perserving semantic similarities in Hamming space. Fast supervised discrete hashing (FSDH) [22]
uses a simple yet effective regression from the class labels of training data points to the corresponding hash code to accelerate the learning process. Convolutional neural networksbased hashing (CNNH)
[66] decomposes the hash function learning into two stages. Firstly, a pairwise similarity matrix is constructed and decomposed into the product of approximate hash codes. Secondly, CNNH simultaneously learns representations and hash functions by training the model to predict the learned hash codes as well as the discrete image class labels. Deep Cauchy hashing (DCH) [5]adopts Cauchy distribution to continue to optimize data pairs in a relatively small Hamming ball.
Unsupervised hashing methods [3, 40, 21, 25, 42] try to encode original data into binary codes by training with unlabeled data points. Iterative quantization (ITQ) [21]
first uses principal component analysis (PCA) to map the data to a low dimensional space and then exploits an alternating minimization scheme to find a rotation matrix, which maps the data to binary codes with minimum quantization error. Discrete graph hashing (DGH)
[40] casts the graph hashing problem into a discrete optimization framework and explicitly deals with the discrete constraints, so it can directly output binary codes. Spherical hashing (SpH) [25] minimizes the spherical distance between the original realvalued features and the learned binary codes. Anchor graph hashing (AGH) [42] utilizes anchor graphs to obtain tractable lowrank adjacency matrices and approximate the data structure. Though current traditional unsupervised hashing methods have made great progress, they usually depend on predefined features and cannot simultaneously optimize the feature and hash code learning processes, thus missing an opportunity to learn more effective hash codes.Unsupervised deep hashing methods [55, 38, 30, 17, 37, 11, 68] adopt deep architectures to extract features and perform hash mapping. Semantic hashing [55]
uses pretrained restricted Boltzmann machines (RBMs)
[49] to construct an autoencoder network, which is then used to generate efficient hash codes and reconstruct the original inputs. Deep binary descriptors (DeepBit) [37] considers original images and the corresponding rotated images as similar pairs and tries to learn hash codes to preserve this similarity. Stochastic generative hashing (SGH) [11] utilizes a generative mechanism to learn hash functions through the minimum description length principle. The hash codes are optimized to maximally compress the dataset as well as to regenerate the inputs. Semantic structurebased unsupervised deep hashing (SSDH) [68]takes advantage of the semantic information in deep features and learns semantic structures based on the pairwise distances and a Gaussian estimation. The semantic structure is then used to guide the hash code learning process. By integrating the feature and hash code learning processes, deep unsupervised hashing methods usually produce better results.
Training classifiers from noisy labels is also a closely related task. We refer the noisy labels to the setting where the labels of data points are corrupted [4, 74, 23, 24]. Since in many situations, it is both expensive and difficult to obtain reliable labels, a growing body of literature has been devoted to learning with noisy labels. Those methods can be organized into two major groups: label noisetolerant classification [2, 48] and label noise cleansing methods [50, 8, 39, 53]
. The former adopts the strategies like decision trees or boostingbased ensemble techniques, while the latter tries to filter the label noise by exploiting the prior information from training samples. For a comprehensive understanding, we recommend readers to read
[20]. By treating the initial similarity relationship as noisy labels, our method can explicitly model the relationship between noisy labels and labels assigned by the Bayes optimal classifiers, which then enables us to extract data pairs with confident similarity signals.3 Approach
Let denote the training set with instances, deep hashing aims to learn nonlinear hash functions , which can encode original data points to compact bit hash codes.
Traditional supervised deep hashing methods usually accept data pairs as inputs, where is a binary label to indicate whether and are similar or not. However, due to the laborious labeling process and the need of requisite domain knowledge, it’s not feasible to directly acquire labels in many tasks. Thus, in this paper, we study the hashing problem under unsupervised settings.
Inspired by the Bayesian classifier theory [18], reliable labels for data pairs can be confidently assigned by an Bayes optimal classifier, i.e.,
(1) 
where . This Bayes optimal classifier implies that if we have access to , we can infer the true data labels with Eq. 1. However, under unsupervised settings, we cannot access .
For unsupervised learning, some recent works
[54, 34, 68] demonstrate that local structures learned from original features can help to capture the similarity relationship between points. Motivated by this, we can roughly label the training data pairs based on their local structures and construct a similarity matrix as(2) 
where denotes the distance of extracted features for and , and are the thresholds for the distance. However, since is only constructed from local structures, they are unreliable and may contain label noise.
Note that, based on we can learn an estimation of the conditional probability . And, there exists a relationship between and as
(3)  
where denotes the flip rate between true labels and noisy labels on given data pair and their label . If we know the values of and , the values of can be easily inferred. However, the values of are unknown. From Eq. 3 we can further get that, when the flip rates and are relatively small, if is large, should also be large, and vice versa. In the following subsection, we show that it is possible to infer whether is smaller or larger than based on some weak information^{1}^{1}1As many of the labels in are correct, we show later that upper bounds for can be easily obtained. of , which means we may potentially achieve the reliable labels for some data pairs. We define those data pairs of which reliable labels can be recovered from as distilled data pairs.
In the following subsection, we theoretically prove that distilled data pairs can be extracted under a mild assumption. And we further provide a method to collect distilled data pairs automatically.
3.1 Collecting Distilled Data Pairs Automatically
To collect distilled data pairs, we first give the following assumption.
Assumption 1.
For any data pairs , we have
(4) 
This assumption implies that label noise is not too heavy. Note that, if the number of correctly labeled data pairs is considered larger than that of mislabeled data pairs, the flip rate will be bounded by . We can see that Assumption 1 is much weaker than the above assumption.
It is hard to prove that the noisy labels constructed by exploiting local structures satisfy Assumption 1. However, the experimental results on three widely used benchmark datasets empirically verify that the assumption applies well to the constructed noisy labels. In the rest of this paper, we always suppose Assumption 1 holds.
We then extend the noisy label learning techniques in [8] to pairwise labels and present the following key theorem, which gives the basic criteria to collect distilled data pairs.
Theorem 1.
For any data pairs , we have
if , then is a distilled data pair;
if , then is a distilled data pair.
Proof.
The tradeoff for selecting distilled data pairs is the need of estimating the conditional probability and the flip rate . To estimate , we adopt a probabilistic classification method. Specifically, we design a deep network to map data pairs to probabilities. Since this objective is similar to the hash code learning, we explore the same architecture for the estimation of and hash code learning. The detailed description of this deep network will be presented in the next subsection.
For the estimation of the flip rate , most existing works [39, 53] assume the noise to be label and instanceindependent or instanceindependent. While in our method, the flip rate should be label and instancedependent, so most existing methods are not suitable for the current problem. Considering the difficulty to directly estimate the flip rate, we alternatively propose a method to obtain an upper bound. Formally, we give the following proposition.
Proposition 1.
Given the conditional probability , the following inequations holds
(6)  
Proof.
However, if we directly combine Proposition 1 and Theorem 1, no distilled data pairs can be selected. So, here we further assume the flip rate to be local invariant, and thus obtain the flip rate upper bounds as
(8)  
where indicates the set of top nearest neighbors for .
With the flip rate upper bounds and , we have
(9)  
The conditional probability can be estimated by the adopted deep networks, and the noisy rate upper bound can be acquired with Eq. 8. Combining these results with Eq. 9 and Theorem 1, we can find that the distilled data pairs can be successfully collected by picking out every pairs that satisfy and assigning label , and picking out every pairs that satisfy and assigning label . The distilled data pair set can be represented as , where is the number of distilled data pairs.
After obtaining the distilled data pair set, we can perform hash code learning, which is similar to the learning process for supervised hashing. Specifically, we adopt a Bayesian learning framework, which is elaborated in the following subsection.
3.2 Bayesian Learning Framework
In this subsection, we propose a Bayesian learning framework to perform deep hashing learning and also estimate the conditional probability . We first introduce the framework for hash code learning and then show how to apply it for the estimation of .
By representing the hash codes for distilled data as , the Maximum Likelihood (ML) estimation of the hash codes can be defined as:
(10) 
where is the conditional probability of similarity label given the hash codes and , which can be naturally approximated by a pairwise logistic function
(11) 
where
is the sigmoid function and
denotes the inner product of the hash codes and . Here, we adopt the inner product, since as indicated in [41], the Hamming distance of hash codes can be inferred from the inner product as: . Hence, the inner product can well reflect the Hamming distance for binary hash codes.Similar to logistic regression, we can find that the smaller the Hamming distance
is, the larger the inner product results and the conditional probability will be. Otherwise, the larger the conditional probability will be. These results imply that similar data points will be enforced to have small Hamming distance and dissimilar data points will be enforced to have large Hamming distance, which are expected for Hamming space similarity search. So, learning with Eq. 10, effective hash codes can be obtained.After training the model, given a data point, we can obtain its hash codes by directly forward propagating it through the adopted network, and obtain the final binary codes by the following sign function
(12) 
The whole learning algorithm is summarized in Algorithm 1.
Since this framework maps data pairs to similarity probabilities, we can also use it to estimate the conditional probability. The main difference lies in that, for hash code learning we use distilled data pairs as inputs, while for conditional probability estimation, we use the data pairs constructed from local structures as inputs.
4 Experiments
We evaluate our method on three popular benchmark datasets, FLICKR25K, NUSWIDE, and CIFAR10, and provide extensive evaluations to demonstrate its performance. In this section, we first introduce the datasets and then present our experimental results.
4.1 Datasets
FLICKR25K [26] contains 25,000 images collected from the Flickr website. Each image is manually annotated with at least one of the 24 unique labels provided. We randomly select 2,000 images as a test set; the remaining images are used as a retrieval set, from which we randomly select 5,000 images as a training set. NUSWIDE [10] contains 269,648 images, each of the images is annotated with multiple labels referring to 81 concepts. The subset containing the 10 most popular concepts is used here. We randomly select 5,000 images as a test set; the remaining images are used as a retrieval set, and 10,500 images are randomly selected from the retrieval set as the training set. CIFAR10 [29] is a popular image dataset containing 60,000 images in 10 classes. For each class, we randomly select 1,000 images as queries and 500 as training images, resulting in a query set containing 10,000 images and a training set made up of 5,000 images. All images except for the query set are used as the retrieval set.
4.2 Baseline Methods
The proposed method is compared with six stateoftheart traditional unsupervised hashing methods (LSH [3], SH [65], ITQ [21], PCAH [64], DSH [28], and SpH [25]) and three recently proposed deep unsupervised hashing methods (DeeBit [37], SGH [11], and SSDH [68]
). All the codes for these methods have been kindly provided by the authors. LSH, SH, ITQ, PCAH, DSH, and SpH are implemented with MATLAB, SGH and SSDH are implemented with TensorFlow
[1], and DeepBit is implemented with Caffe
[27]. We use TensorFlow when write our code, and run the algorithm in a machine with one Titan X Pascal GPU.4.3 Evaluation.
To evaluate the performance of our proposed method, we adopt three evaluation criteria: mean of average precision (MAP), topNprecision, and precisionrecall. The first two criteria are based on Hamming ranking, which ranks data points based on their Hamming distances to the query; for its part, precisionrecall is based on hash lookup. More detailed introductions are given as follows.
MAP is one of the most widelyused criteria for evaluating retrieval accuracy. Given a query and a list of ranked retrieval results, the average precision (AP) for this query can be computed. MAP is then defined as the average of APs for all queries. For all three datasets, we set as the number of the retrieval set. TopNprecision is defined as the average ratio of similar instances among the top retrieved instances for all queries in terms of Hamming distance. In our experiments, is set to 1,000. Precisionrecall reveals the precisions at different recall levels and is a good indicator of overall performance. Typically, the area under the precisionrecall curve is computed. A larger Precisionrecall value always indicates better performance.
4.4 Implementation Details
To initialize the noisy similarity matrix in Eq. (2), we select the cosine distance as the distance to measure the local structure of training examples. The threshold and are selected as indicated in [68]. For the adopted deep networks, we use the VGG16 architecture [58] and replace the last fullyconnected layer with a new fullyconnected layer with units for hash code learning. For the estimation of the conditional probability , we set the dimensions of the last fullyconnected layer as , which is in our experiments. To obtain the upper bound of the flip rate, we set as . The parameter sensitivity of our algorithm with regard to and are analyzed in Subsection 4.6
. Parameters for the new fully connected layer are learned from scratch, while parameters for the preceding layers are finetuned from the model pretrained on ImageNet
[16]. We employ the standard stochastic gradient descent algorithm with 0.9 momentum for optimization, minbatch size is set to 64, and the learning rate is fixed to
. Two data points are considered neighbors if they share the same label (for CIFAR10) or share at least one common label (for the multilabel datasets FLICKR25K and NUSWIDE).For a fair comparison, we adopt the deep features extracted from the last fullyconnected layer from the VGG16 network pretrained on ImageNet for all shallow architecturebased baseline methods. These deep features are also used for the construction of
. Since VGG16 accepts images of size as inputs, we resize all images to be before inputting them into the VGG16 network. Random rotation and flipping are also used for data augmentation.method  FLICKR25K  NUSWIDE  CIFAR10  
16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  
LSH [3]  0.5831  0.5885  0.5933  0.6014  0.4324  0.4411  0.4433  0.4816  0.1319  0.1580  0.1673  0.1794 
SH [65]  0.5919  0.5923  0.6016  0.6213  0.4458  0.4537  0.4926  0.5000  0.1605  0.1583  0.1509  0.1538 
ITQ [21]  0.6192  0.6318  0.6346  0.6477  0.5283  0.5323  0.5319  0.5424  0.1942  0.2086  0.2151  0.2188 
PCAH [64]  0.6091  0.6105  0.6033  0.6071  0.4625  0.4531  0.4635  0.4923  0.1432  0.1589  0.1730  0.1835 
DSH [28]  0.6074  0.6121  0.6118  0.6154  0.5200  0.5227  0.5345  0.5370  0.1616  0.1876  0.1918  0.2055 
SpH [25]  0.6108  0.6029  0.6339  0.6251  0.4532  0.4597  0.4958  0.5127  0.1439  0.1665  0.1783  0.1840 
DeepBit [37]  0.5934  0.5933  0.6199  0.6349  0.4542  0.4625  0.47616  0.4923  0.2204  0.2410  0.2521  0.2530 
SGH [11]  0.6162  0.6283  0.6253  0.6206  0.4936  0.4829  0.4865  0.4975  0.1795  0.1827  0.1889  0.1904 
SSDH [68]  0.6621  0.6733  0.6732  0.6771  0.6231  0.6294  0.6321  0.6485  0.2568  0.2560  0.2587  0.2601 
DistillHash  0.6964  0.7056  0.7075  0.6995  0.6667  0.6752  0.6769  0.6747  0.2844  0.2853  0.2867  0.2895 
4.5 Results and Discussion
We first present the MAP values for all methods with different hash bit lengths, then draw precisionrecall and TopNprecision curves for all methods with 32 and 64 hash code lengths to give a more comprehensive comparison.
Table 1 presents the MAP results for DistillHash and all baseline methods on FLICKR25K, NUSWIDE, and CIFAR10, with hash code numbers varying from 16 to 128. By comparing the dataindependent method LSH with other datadependent methods, we can see that datadependent hashing methods outperform the dataindependent hashing method in most cases. This may be because that datadependent methods learn hash functions from data, so can better capture the used data structures. By comparing deep hashing methods and nodeep hashing methods, we find that nodeep hashing methods can surpass deep hashing methods DeepBit and SGH in some cases. This may be because that, without proper supervisory signals, deep hashing methods cannot fully exploit the representation ability of deep networks, and may achieve unsatisfactory performance by overfitting to bad local minima. While, by exploiting local structures, deep hashing methods (SSDH and DistillHash) achieve more promising results.
Concretely, from the MAP results, we can see that DistillHash consistently obtains the best results across different hash bit lengths for all three datasets. Specifically, compared to one of the best nondeep hashing methods, i.e, ITQ, we achieve absolute improvements of , , and in the average MAP for different bits on FLICKR25K, NUSWIDE, and CIFAR10 respectively. Compared to the stateoftheart deep hashing method SSDH, we achieve absolute improvements of , , and in average MAP for different bits on the three datasets respectively. Note that DeepBit, SGH, SSDH, and DistillHash are both deep hashing methods, only SSDH and DistillHash can exploit and preserve the similarity of different data points, thus they can achieve better performance than the other two. Moreover, DistillHash learns more accurate similarity relationships by distilling some data pairs, so can obtain a further performance improvement than SSDH.
The left two subfigures of Figure 1, 2, and 3 present the TopNprecision curves for all methods on each of the three datasets with hash bit lengths of 16 and 32. Consistent with MAP results, we can observe that DistillHash achieves the best results among all approaches.
method  FLICKR25K  NUSWIDE  CIFAR10  
16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  16 bits  32 bits  64 bits  128 bits  
DistillHash*  0.6653  0.6633  0.6726  0.6784  0.6322  0.6357  0.6480  0.6451  0.2547  0.2538  0.2573  0.2583 
DistillHash  0.6964  0.7056  0.7075  0.6995  0.6667  0.6752  0.6769  0.6747  0.2844  0.2853  0.2867  0.2895 
Since MAP values and TopNprecision curves are both Hamming rankingbased metrics, an overview of the above analysis reveals that DistillHash can achieve superior performance for Hamming rankingbased evaluations. Moreover, to illustrate the hash lookup results, we plot the precisionrecall curves for all methods with hash bit lengths of 16 and 32 in the right two subfigures of Figure 1, 2, and 3. From the results, we can again observe that DistillHash consistently achieves the best performance, which further demonstrates the superiority of our proposed method.
To investigate the change of loss values through the training process, we display the loss values in Figure 4. The results reveal that our methods can converge in all cases within 1,000 iterations.
4.6 Parameter Sensitivity
We next investigate the sensitivity of hyperparameters and . Figure 5 shows the effect of these two hyperparameters on NUSWIDE dataset with hash code lengths of 16, 32, 64, and 128. We first fix to and evaluate the MAP by varying between and , the results are presented in Figure 5(a). The performance shows that the algorithm is not sensitive to parameter in the range of , and we can set as any number in the range of . In our experiments, we set as . Figure 5(b) shows the MAP by varying between and with fixed to . The performance of DistillHash first increases and then keeps at a relatively high level. The result is also not sensitive to in the range of . For other experiments in this paper, we select as .
4.7 Ablation Study
In this subsection, we go deeper to study the efficacy of the proposed distilled data pair learning. More specifically, we investigate DistillHash*, a variant of DistillHash with the same Bayesian learning framework but trained with the initial similarity label . The MAP results of DistillHash* and DistillHash are shown in Table 2, from which we can see that DistillHash consistently outperforms DistillHash* by margins of , , and for the FLICKR25K dataset, , , and for the NUSWIDE dataset, and , , and for the CIFAR10 dataset at hash bit lengths of 16, 32, 64, and 128 respectively. Note that the only difference between DistillHash and DistillHash* lies in that DistillHash is trained with distilled data set and DistillHash* is trained with the initial data set. The performance improvements clearly demonstrate the superiority of the proposed distilled data pair learning.
5 Conclusions
This work presented a new unsupervised deep hashing approach for image search, namely DstilHash. Firstly, we theoretically investigated the relationship between the Bayes optimal classifier and noisy labels learned from local structures, showing that distilled data pairs can be potentially collected. Secondly, with the above understanding, we provided a simple yet effective scheme to automatically distill data pairs. Thirdly, leveraging a distilled data set, we designed a deep hashing model and adopted a Bayesian learning framework to perform the hash code learning. The experimental results on three benchmark datasets demonstrated that the proposed DistillHash surpasses other stateoftheart methods.
6 Acknowledgements
This work was also supported in part by the National Natural Science Foundation of China under Grant 61572388 and 61703327, in part by the Key R&D ProgramThe Key Industry Innovation Chain of Shaanxi under Grant 2017ZDCXLGY050402, 2017ZDCXLGY0502 and 2018ZDXMGY176, in part by the National Key R&D Program of China under Grant 2017YFE0104100, and in part by the Australian Research Council Projects DP180103424, DE1901014738, and FL170100117.
References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: a system for largescale machine learning.
In OSDI, volume 16, pages 265–283, 2016.  [2] Kamal M Ali and Michael J Pazzani. Error reduction through learning multiple descriptions. Machine Learning, 24(3):173–202, 1996.
 [3] Alexandr Andoni and Piotr Indyk. Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS, pages 459–468, 2006.
 [4] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label noise. In Asian Conference on Machine Learning, pages 97–112, 2011.
 [5] Yue Cao, Mingsheng Long, Bin Liu, Jianmin Wang, and MOE KLiss. Deep cauchy hashing for hamming space retrieval. In CVPR, pages 1229–1237, 2018.
 [6] Xinyuan Chen, Chang Xu, Xiaokang Yang, and Dacheng Tao. Attentiongan for object transfiguration in wild images. In ECCV, pages 164–180, 2018.
 [7] Zhixiang Chen, Xin Yuan, Jiwen Lu, Qi Tian, and Jie Zhou. Deep hashing via discrepancy minimization. In CVPR, June 2018.
 [8] Jiacheng Cheng, Tongliang Liu, Kotagiri Ramamohanarao, and Dacheng Tao. Learning with bounded instanceand labeldependent label noise. arXiv preprint arXiv:1709.03768, 2017.
 [9] Lianhua Chi, Bin Li, Xingquan Zhu, Shirui Pan, and Ling Chen. Hashing for adaptive realtime graph stream classification with concept drifts. IEEE Trans. Cybern., 48(5):1591–1604, 2018.
 [10] TatSeng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. Nuswide: a realworld web image database from national university of singapore. In CIVR, page 48, 2009.
 [11] Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, and Le Song. Stochastic generative hashing. In ICML, 2017.
 [12] Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. Tripletbased deep hashing network for crossmodal retrieval. IEEE Trans. Image Process., 27(8):3893–3903, 2018.
 [13] Cheng Deng, Huiru Deng, Xianglong Liu, and Yuan Yuan. Adaptive multibit quantization for hashing. Neurocomputing, 151:319–326, 2015.
 [14] Cheng Deng, Xu Tang, Junchi Yan, Wei Liu, and Xinbo Gao. Discriminative dictionary learning with common label alignment for crossmodal retrieval. IEEE Trans. Multimedia, 18(2):208–218, 2016.
 [15] Cheng Deng, Erkun Yang, Tongliang Liu, Wei Liu, Jie Li, and Dacheng Tao. Unsupervised semanticpreserving adversarial hashing for image search. IEEE Transactions on Image Processing, 2019.
 [16] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
 [17] Kamran Ghasedi Dizaji, Feng Zheng, Najmeh Sadoughi Nourabadi, Yanhua Yang, Cheng Deng, and Heng Huang. Unsupervised deep generative adversarial hashing network. In CVPR, 2018.
 [18] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zeroone loss. Machine learning, 29(23):103–130, 1997.
 [19] Yuxuan Du, MinHsiu Hsieh, Tongliang Liu, and Dacheng Tao. The expressive power of parameterized quantum circuits. arXiv preprint arXiv:1810.11922, 2018.
 [20] Benoît Frénay and Michel Verleysen. Classification in the presence of label noise: a survey. IEEE Trans. Neural Netw. Learn. Syst., 25(5):845–869, 2014.

[21]
Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin.
Iterative quantization: A procrustean approach to learning binary codes for largescale image retrieval.
IEEE Trans. Pattern Anal. Mach. Intell., 35(12):2916–2929, 2013.  [22] Jie Gui, Tongliang Liu, Zhenan Sun, Dacheng Tao, and Tieniu Tan. Fast supervised discrete hashing. IEEE Trans. Pattern Anal. Mach. Intell., 40(2):490–496, 2018.
 [23] Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor Tsang, Ya Zhang, and Masashi Sugiyama. Masking: A new perspective of noisy supervision. In Advances in Neural Information Processing Systems, pages 5841–5851, 2018.
 [24] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Coteaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, pages 8536–8546, 2018.
 [25] JaePil Heo, Youngwoon Lee, Junfeng He, ShihFu Chang, and SungEui Yoon. Spherical hashing. In CVPR, pages 2957–2964, 2012.
 [26] Mark J. Huiskes and Michael S. Lew. The mir flickr retrieval evaluation. In ICMR, 2008.
 [27] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678, 2014.
 [28] Zhongming Jin, Cheng Li, Yue Lin, and Deng Cai. Density sensitive hashing. IEEE Trans. Cybern., 44(8):1362–1371, 2014.
 [29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, 2009.

[30]
Alex Krizhevsky and Geoffrey E Hinton.
Using very deep autoencoders for contentbased image retrieval.
In ESANN, 2011.  [31] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [32] Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. Selfsupervised adversarial hashing networks for crossmodal retrieval. In CVPR, pages 4242–4251, 2018.
 [33] Chao Li, Cheng Deng, Lei Wang, De Xie, and Xianglong Liu. Coupled cyclegan: Unsupervised hashing network for crossmodal retrieval. arXiv preprint arXiv:1903.02149, 2019.
 [34] Siyang Li, Bryan Seybold, Alexey Vorobyov, Alireza Fathi, Qin Huang, and CC Jay Kuo. Instance embedding transfer to unsupervised video object segmentation. In CVPR, pages 6526–6535, 2018.
 [35] WuJun Li, Sheng Wang, and WangCheng Kang. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 1711–1717, 2016.
 [36] Yeqing Li, Wei Liu, and Junzhou Huang. Subselective quantization for learning binary codes in largescale image search. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1526–1532, 2018.
 [37] Kevin Lin, Jiwen Lu, ChuSong Chen, and Jie Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In CVPR, pages 1183–1192, 2016.
 [38] Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, Jie Zhou, et al. Deep hashing for compact binary codes learning. In CVPR, volume 1, page 3, 2015.
 [39] Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell., 38(3):447–461, 2016.
 [40] Wei Liu, Cun Mu, Sanjiv Kumar, and ShihFu Chang. Discrete graph hashing. In NIPS, pages 3419–3427, 2014.
 [41] W. Liu, J. Wang, R. Ji, Y. Jiang, and S.F. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
 [42] Wei Liu, Jun Wang, Sanjiv Kumar, and ShihFu Chang. Hashing with graphs. In ICML, pages 1–8, 2011.

[43]
Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, and ShihFu Chang.
Compact hyperplane hashing with bilinear functions.
In ICML, 2012.  [44] Wei Liu and Tongtao Zhang. Multimedia hashing and networking. IEEE MultiMedia, 23(3):75–79, 2016.
 [45] Xianglong Liu, Cheng Deng, Bo Lang, Dacheng Tao, and Xuelong Li. Queryadaptive reciprocal hash tables for nearest neighbor search. IEEE Trans. Image Process., 25(2):907–919, 2016.
 [46] Xianglong Liu, Bowen Du, Cheng Deng, Ming Liu, and Bo Lang. Structure sensitive hashing with adaptive product quantization. IEEE Trans. Cybern., 46(10):2252–2264, 2016.
 [47] Yu Liu, Jingkuan Song, Ke Zhou, Lingyu Yan, Li Liu, Fuhao Zou, and Ling Shao. Deep selftaught hashing for image retrieval. IEEE Trans. Cybern., 2018.
 [48] Prem Melville, Nishit Shah, Lilyana Mihalkova, and Raymond J Mooney. Experiments on ensembles with missing and noisy data. In MCS Workshop, pages 293–302, 2004.
 [49] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
 [50] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NIPS, pages 1196–1204. 2013.
 [51] Wing WY Ng, Xing Tian, Witold Pedrycz, Xizhao Wang, and Daniel S Yeung. Incremental hashbit learning for semantic image retrieval in nonstationary environments. IEEE Trans. Cybern., (99), 2018.
 [52] M. Norouzi and D. M Blei. Minimal loss hashing for compact binary codes. In ICML, pages 353–360, 2011.
 [53] Curtis G Northcutt, Tailin Wu, and Isaac L Chuang. Learning with confident examples: Rank pruning for robust classification with noisy labels. In UAI, 2017.
 [54] Pedro O Pinheiro and AI Element. Unsupervised domain adaptation with similarity learning. In CVPR, pages 8004–8013, 2018.
 [55] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. Int. J. Approx. Reasoning., 50(7):969–978, 2009.
 [56] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. Supervised discrete hashing. In CVPR, volume 2, page 5, 2015.
 [57] Yuming Shen, Li Liu, Fumin Shen, and Ling Shao. Zeroshot sketchimage hashing. In CVPR, pages 3598–3607, 2018.
 [58] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [59] Jingkuan Song, Yi Yang, Xuelong Li, Zi Huang, and Yang Yang. Robust hashing with local models for approximate similarity search. IEEE Trans. Cybern., 44(7):1225–1236, 2014.
 [60] Chaoyue Wang, Chang Xu, Xin Yao, and Dacheng Tao. Evolutionary generative adversarial networks. IEEE Trans. Evol. Comput., 2019.
 [61] Hao Wang, Yanhua Yang, Erkun Yang, and Cheng Deng. Exploring hybrid spatiotemporal convolutional networks for human action recognition. Multimedia Tools and Applications, 76(13):15065–15081, 2017.
 [62] Jun Wang, Wei Liu, Sanjiv Kumar, and ShihFu Chang. Learning to hash for indexing big data a survey. Proceedings of the IEEE, 104(1):34–57, 2016.
 [63] Shengnan Wang, Chunguang Li, and HuiLiang Shen. Distributed graph hashing. IEEE Trans. Cybern., (99):1–13, 2018.
 [64] XinJing Wang, Lei Zhang, Feng Jing, and WeiYing Ma. Annosearch: Image autoannotation by search. In CVPR, volume 2, pages 1483–1490, 2006.
 [65] Yair Weiss, Antonio Torralba, and Rob Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2009.
 [66] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, volume 1, page 2, 2014.
 [67] Erkun Yang, Cheng Deng, Chao Li, Wei Liu, Jie Li, and Dacheng Tao. Shared predictive crossmodal deep quantization. IEEE Trans. Neural Netw. Learn. Syst., 2018.
 [68] Erkun Yang, Cheng Deng, Tongliang Liu, Wei Liu, and Dacheng Tao. Semantic structurebased unsupervised deep hashing. In IJCAI, pages 1064–1070, 2018.
 [69] Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng Tao, and Xinbo Gao. Pairwise relationship guided deep hashing for crossmodal retrieval. In AAAI, pages 1618–1625, 2017.
 [70] Erkun Yang, Tongliang Liu, Cheng Deng, and Dacheng Tao. Adversarial examples for hamming space search. IEEE Trans. Cybern., 2018.
 [71] Xu Yang, Cheng Deng, Xianglong Liu, and Feiping Nie. New l2, 1norm relaxation of multiway graph cut for clustering. In AAAI, pages 4374–4381, 2018.
 [72] Shan You, Chang Xu, Yunhe Wang, Chao Xu, and Dacheng Tao. Privileged multilabel learning. In IJCAI, pages 3336–3342, 2017.
 [73] Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In ACM SIGKDD, pages 1285–1294. ACM, 2017.
 [74] Xiyu Yu, Tongliang Liu, Mingming Gong, Kun Zhang, and Dacheng Tao. Transfer learning with label noise. arXiv preprint arXiv:1707.09724, 2017.
 [75] Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and TatSeng Chua. Discrete collaborative filtering. In SIGIR, pages 325–334. ACM, 2016.
Comments
There are no comments yet.