1 Introduction
Approximate nearest neighbour search with binary representations has been regarded as an effective and efficient solution to largescale multimedia data retrieval. Conventionally termed as learning to hash, this family of techniques aims at (a) shrinking the embedding size of data and (b)
producing binary features to speedup the computation of distancebased pairwise data relevance. Similar to many other machine learning tasks, learning to hash can be either unsupervised or supervised. The former requires less labeling efforts for training, while the later obtains better performance in retrieval. We focus on supervised hashing to fully leverage the semantic information of data.
Recent research in this field largely boosts the performance of the produced hash codes by introducing deep learning techniques. Deep hashing models typically employ an indifferentiable
sign activation to the top of the encoding network. Various methods have been proposed to empower the encoder with the ability to properly locate data in the Hamming space.A typical approach is to employ a heldout code learner as the network training complementary [11, 29, 40]. The code learner performs discrete optimization and alternately updates the semanticbased target codes to govern the behavior of the encoding network. This approach generally requires longer training time since the heldout discrete optimization step cannot be effectively paralleled, and consumes additional memory to cache the target codes during each round of update. Alternatively, some propose to decouple unrelated data representations by introducing similaritybased penalties to the encoders [7, 42, 43, 44]. To train an encoder with these regularizers, one may resort to continuous relaxation on the codes, which arguably degrades the training quality. One recent fashion in deep hashing is to employ generative adversarial models [5, 13, 34, 45]. By distinguishing synthesized data from real ones, the encoder implicitly acknowledges the respective data distribution.
However, the above preciselyproposed approaches raise another question: How to build an effective supervised hashing model with minimum auxiliary components?
We attempt to find the answer by carefully considering the following main challenges of learning to hash:

Keeping the discrete nature of binary codes. The binary constraints usually lead to an NPhard optimization problem in parameterized models, and cannot be directly solved by gradientbased methods. This is usually addressed by conventional methods using heldout discrete optimization or relaxation techniques.

Enriching the information carried by the codes. It is always essential to make the encoder aware of the semantic information (, lables or tags) of data.
As a result, in this paper, we propose a simple but powerful deep hashing network. In our model, the above problems are tackled by relating data and their semantics with a binary representation bottleneck, which is thereafter used as the final hash codes. A single recognition penalty is applied for training. With a reasonable regularization term, the final learning objective forms a variational lower bound of the Information Bottleneck (IB) [2, 36]
between observed data and their semantics. Importantly, one can impose stochasticity on the binary bottleneck to keep the binary constraints and apply gradient estimation methods during training. Therefore, the whole framework can be optimized endtoend with Stochastic Gradient Descent (SGD). To this end, we find our design leads to an embarrassingly simple solution,which basically shapes a single classification neural network .
Regardless of the regularization, the proposed model just maximizes the label likelihood of data. Thus, we name our model JustMaximizingLikelihood Hashing (JMLH). The contributions of this paper are summarized as follows:

We propose a simple and novel deep hashing model, , JMLH, and theoretically base it on the Variational Information Bottleneck (VIB) [2] method. To the best of our knowledge, JMLH is the first attempt in deep hashing to employ the IB methods.

We show that, when properly designed and trained, a classification neural network with a discrete bottleneck already produces effective binary representations. Therefore, the proposed model requires no auxiliary components and can be optimized directly.

Relations between JMLH and many existing hashing models are discussed in detail.
2 Model
The goal of learning to hash is to find an optimal encoding function to represent data. Here is the variable space of data observation and refers to the length of the hash code space . In the context of supervised hashing, training is usually supported by the data labels . We intendedly use capitalized notations, , , and
, for the (random) variable spaces, and denote each respective variable instances with lowercased letters,
, , and .2.1 JMLH at a Glance
JMLH involves a stochastic encoder
and a classifier
. An additional deterministic distribution is used as the prior of .^{1}^{1}1Here we use to denote an approximated posterior when one cannot directly model the corresponding true distribution, , . On the other hand, is used when the distribution can be deterministically defined or computed, , the predefined prior . This model is illustrated in Figure 1 as a directed graphical model. Particularly, each datum is firstly associated with a latent binary code according to , and then the respective label can be predicted by feeding with . Therefore, can be regarded as the bottleneck between and . Successively applying and according to the above procedure specifies a singletask neural network with a binary layer in between, which makes JMLH extremely simple.We firstly describe the abovementioned probabilistic models and then discuss how they are combined as a whole for efficient endtoend training.
2.1.1 Parameterizing the Probability Models
Given a training pair of
, the corresponding probabilities models of
and in JMLH are defined as(1) 
Here
indicates the Poisson binomial distribution, parameterized by a neural network
as follows:(2) 
On the other hand, can be either categorical for singlelabel classification, , , or Poisson binomial for multilabel classification, , , implemented by another network . We additionally introduce of a binomial distribution as the code prior for regularization purpose.
Note that we choose discrete probability models for to avoid the use of continuous relaxation. That is to say, the input to the classifier
is already binarized. Continuous relaxation,
, activating the neurons with a
sigmoidnonlinearity, is not considered here as it skews the observation of the classifier, propagating biased semantic information measurement back to the encoder.
2.1.2 Shaping a Single Network
Sequentially stacking and empirically forms a classification neural network with a binary bottleneck , of which the briefed structure is illustrated in Table 1. It can be seen that JMLH only introduces two additional layers on the top of an arbitrary network backbone, which makes it easy to be adopted to different pretrained models and is convenient for implementation.
Then we define the learning objective with given training pairs of this single network as
(3) 
where is a hyperparameter. All the probability models are defined in Eq. (1). We first elaborate each component of it in this subsection and later show that this learning objective is supported by VIB [2] in Section 2.2.1.
Notation  Specification  Variable 

Input  Arbitrary data,  
images in our experiments  
Arbitrary network backbone,  
Alexnet [21] before fc_7  
in our experiments  
Fullyconnected, size of  
Binary stochastic activation  
Fullyconnected, size of label length  
softmax (singlelabel datasets)  
sigmoid (multilabel datasets) 
The first RightHandSide (RHS) term of Eq. (3), , is actually a negative loglikelihood classification penalty since is categorical. This loss conveys semantic label information of data to their codes during training.
The second RHS term of Eq. (3) acts as a regularizer. By minimizing the KullbackLeibler (KL) divergence between the posterior and the prior , the entropy carried by is reserved. As the prior and the posterior are basically binomial,the KL divergence can be deterministically computed by two entropy terms :
(4) 
The whole network of JMLH is trained only using Eq. (3
). This makes the optimization extremely simple, requiring no auxiliary module or additional complex loss function. The only problem comes from the gradient computation of the intractable expected negative loglikelihood
, which is discussed in Section 2.1.3.2.1.3 On the Tractability of JMLH
Computing the gradients of the negative loglikelihood expectation term of Eq. (3) is intractable. One needs to traverse the latent space of for each sample to accurately obtain the loss and corresponding gradients. Inspired by [10], we use the following reparametrization of :^{2}^{2}2Although the reparametrization trick [19] is initially designed for continuous variables, we keep using this terminology here, because the trick proposed in [10] leads to a similar gradient estimator to the one of [19].
(5) 
where each is a small random signal. Eq. (5) is conventionally termed as the stochastic binary neural activation. With this reparametrization, the gradient of the encoder parameters can be estimated by the distributional derivative estimator [10]:
(6) 
With this estimator, the network of JMLH can be trained with SGD endtoend. Note that can be deterministically obtained and does not require approximation since does not involve stochasticity.
The whole training process is illustrated in Algorithm 1, and the respective variable feed path is illustrated in Figure 2 (a). Here we use to denote the gradient scaler, which is the Adam optimizer [18] in this work. It can be seen that, during training, JMLH performs identically to a normal neural classifier. The only additional step is just to sample the random signals .
2.1.4 OutofSample Extension
Given a query datum , the corresponding hash code is produced by the encoder, ,
(7) 
which is shown in Figure 2 (b).
2.2 Theoretical Analysis
2.2.1 Exploring the Information Bottleneck
In this subsection, we show that JMLH defines a special discrete extension of VIB [2] to learn informationrich codes. By empirically assigning the joint probability of and with the Dirac delta function , , data samples are independent, the negative learning objective of JMLH can be rewritten as
(8) 
where the first RHS term is the variational lower bound of the mutual information with the second RHS term the lower bound of the negative mutual information according to [2]. Consequently, literally lowerbounds the IB [36] objective :
(9) 
We refer to the related articles [2, 36] for more detailed definitions.
Intuitively, our learning objective allows to maximally represent the semantic meaning of the label space by ascending . Note that, though acts as a penalty in Eq. (9), we are not expecting zero mutual information between and , otherwise the produced codes would be dataindependent. The purpose of introducing is to filter redundant information not related to the semantic meanings of data during encoding, and simultaneously preserve the essential part to support . In this way, the learned codes can be compressed and discriminative.
2.2.2 Nearest Neighbour Search with Recognition
In the context of largescale data retrieval, relevant data pairs are usually and conveniently defined by sharing the labels/tags, which is generally reasonable. It is trivial and inefficient to traverse all data points in a dataset and explicitly assign pairwise similarity marks to each of them, while the labels/tags can be regarded as the similarity ‘anchors’ to ease this process.
JMLH favors this setting as it is literally a special classifier during training. The bottleneck latents are directly linked to the data labels. When the model is welltrained, the codes of relevant data are naturally located with short Hamming distances. This idea has also been proved in many labelbased hashing approaches [17, 29].
3 Related Work
Our work is related to various hashing techniques, of which the most popular and related ones are selectively discussed according to our motivation and design.
3.1 Solving the Discrete Constraints
Traditional solutions. We firstly look at the problem of discrete optimization. A typical example is SDH [29], which also sequentially behaves encoding and classification. However, as SDH [29] resorts to Discrete Cyclic Coordinate descent (DCC) for alternating code updating, a heldout optimization step is involved. Practically, this is hard for parallelization and batchwise optimization. Additionally, training errors of the classification step cannot be efficiently propagated back to the encoder. A similar paradigm can be found in [39], while its objective is based on pairwise data similarity. In both singlemodal hashing [40, 11] and crossmodal hashing [23, 32], alternating code updating is widely adopted. For those methods that have heldout codelearners, the network is regularized by the produced target code. The disadvantage of this disarticulated process is the low training quality. On the other hand, regularizing the network by quantization is also widely considered [6, 12, 17, 30]. However, these approaches ignore a severe problem of the different presence of codes. The network observes continuous codes during training, which may represent different meanings from their discrete counterparts for test. This problem is explicitly solved in JMLH as our code bottleneck is exactly binary.
Gradient estimation solutions. Some existing hashing models solve the discrete constraints for SGD by gradient estimation techniques so that the hashing model can be conveniently trained. In SGH [10], a distributional derivative estimator is proposed based on the Taylor expansion of the gradient, and the discreteness is kept by the stochastic neuron. This approach has a similar presence to the reparametrization trick [19], and is unbiased and stable during training. This is also adopted in [31], and JMLH follows the same idea. An alternative simple choice here is the StraightThrough (ST) estimator [3], which is used in GreedyHash [35]. The REINFORCE algorithm [38] is also employed for the same purpose in [41]
, while it undergoes high variance during training.
3.2 Enriching the Semantic Information
JMLH is not the first model that trains the hashing network with classification objectives. For instance, SUBIC [17] also employs a classification loss as its learning objective. Specifically, SUBIC [17] separates the hash code into blocks and ground each code block on a simplex in order to favor the discreteness. This approach considerably limits the maximal information carried by the codes. Besides, the supervised version of GreedyHash [35] is similar to JMLH both in terms of classification objective and keeping the discrete constraints. However, GreedyHash [35] only uses the quantization loss on the code bottleneck, ignoring the entropy of the codes, while we consider minimizing to preserve the entropy. Moreover, GreedyHash [35] provides no theoretical clue of how the trained codes are related to data semantics.
MIHash [4] borrows the concept of mutual information as with JMLH, ending up with different designs. Our model reflects the mutual information between codes and data semantics as a part of VIB [2], while MIHash [4] considers relevantirrelevant code distribution discrepancy and requires complex histogram binning [37] during training.
Recently, a popular idea in deep representation learning is to employ Generative Adversarial Networks (GANs) [16] during training, which has been attempted in [5, 13, 34, 45]. The discriminators or the encoders in GANs are aware of the data distribution without explicitly parameterizing . The problem is that the auxiliary generator significantly increases the training complexity as more parameters are introduced.
We experimentally show that the above sophisticated designs are not always necessarily needed as the simple network of JMLH can already achieve the stateoftheart retrieval performance.
4 Experiments
Method  Super  CIFAR10 (mAP@all)  NUSWIDE (mAP@5000)  ImageNet (mAP@1000)  

vision  16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  16 bits  32 bits  64 bits  
ITQ [14]  ✗  0.201  0.207  0.235  0.627  0.645  0.664  0.217  0.317  0.391 
AGH [26]  ✗  0.217  0.205  0.182  0.592  0.615  0.616  0.241  0.327  0.379 
DGH [24]  ✗  0.199  0.200  0.212  0.572  0.607  0.627  0.270  0.341  0.373 
KSH [25]  ✓  0.451  0.473  0.507  0.448  0.520  0.566  0.216  0.257  0.394 
ITQCCA [15]  ✓  0.463  0.498  0.505  0.555  0.512  0.460  0.235  0.377  0.576 
SDH [29]  ✓  0.499  0.525  0.546  0.595  0.595  0.617  0.298  0.431  0.504 
CNNH [39]  ✓  0.453  0.509  0.537  0.570  0.583  0.600  0.281  0.450  0.554 
DNNH [22]  ✓  0.556  0.558  0.599  0.598  0.616  0.639  0.290  0.461  0.565 
DHN [43]  ✓  0.564  0.603  0.626  0.637  0.664  0.671  0.311  0.472  0.573 
HashNet [8]  ✓  0.643  0.675  0.687  0.662  0.699  0.716  0.506  0.631  0.684 
MIHash [4]  0.760  0.776  0.761  0.722  0.759  0.779  0.569  0.661  0.694  
HashGAN [5]  ✓  0.668  0.731  0.749  0.715  0.737  0.748       
PGDH [41]  ✓  0.741  0.747  0.762  0.780  0.786  0.792  0.653  0.707  0.716 
GreedyHash [35]  ✓  0.786  0.810  0.833        0.625  0.662  0.688 
JMLH (Ours)  ✓  0.805  0.841  0.837  0.795  0.818  0.820  0.668  0.714  0.727 
Extensive image retrieval experiments are conducted in this section, mainly according to the following themes:

Comparison with existing methods. We show that, simple as JMLH is, it still outperforms stateoftheart hashing models.

Ablation study. The importance of each part of JMLH is evaluated and discussed.

Intuitive results. Some illustrative results are provided to implicitly justify the effectiveness of JMLH.
4.1 Experimental Settings
4.1.1 Implementation Details
JMLH is implemented with the popular deep learning toolbox Tensorflow
[1]. The network specifics are provided in Table 1. For our image retrieval task, AlexNet [21] before the fc_7 layer is adopted as the network backbone, where parameters are initialized with the ImageNet [28] pretrained results and is jointly updated during training. For multilabeled datasets, , NUSWIDE [9], we activates the last layer of with the sigmoid nonlinearity, while the softmax activation is used when training JMLH on CIFAR10 [20] and ImageNet [28]. JMLH involves one hyperparameter, , the regularization factor . We empirically set . The learning rate of the Adam optimizer [18] is set to . We fix the training batch size to 256. The codes can be found at https://github.com/ymcidence/JMLH.4.1.2 Datasets
CIFAR10 [20] consists of 60,000 images from 10 classes. We follow the common setting [13, 22, 35] and select 1,000 images (100 per class) as the query set. The remaining 59,000 images are regarded as the database. The training set contains 5000 images, uniformly selected from the database.
NUSWIDE [9] is a collection of nearly 270,000 Web images of 81 categories downloaded from Flickr. Following the settings in [26, 39, 22], we adopt the subset of images from the 21 most frequent categories. 100 images of each class are utilized as a query set and the remaining images form the database. For training, we employ 10,500 images uniformly selected from the 21 classes.
ImageNet [28] is originally released for largescale image classification purpose, and is recently used in deep hashing evaluation. Following [8, 41], we randomly select 100 categories to perform our retrieval task. All the original training images are used as the database, and all the validation images form the query set. For each category, 130 images are used for training.
4.2 Comparison with Existing Methods
We compare JMLH with existing methods using conventional evaluation metrics, including top
meanAverage Precision (mAP@), Precision of top retrieved samples (Precision@), Precision within Hamming radius of 2 (P@H2) and PrecisionRecall (PR) curves.Note that, for mAP@, we adopt the most popular settings of for CIFAR10, NUSWIDE, and ImageNet respectively according to [13, 35, 41].
4.2.1 Baselines
JMLH is compared with various widely recognized hashing baselines, including ITQ [14], AGH [26], DGH [24], KSH [25], ITQCCA [15], SDH [29], CNNH [39], DNNH [22], DHN [43], HashNet [8], HashGAN [5] PGDH [41] and the supervised version of GreedyHash [35]. Note that the term of HashGAN is used both in [13] and [5]. Here we refer to the later one as it is a supervised approach and thus is more related to our work.
For featurebased models, , shallow hashing models, we use the AlexNet [21] fc_7 pretrained features to represent data for training and test. As for the endtoend baseline frameworks, we directly adopt the original training settings described in their original papers and pretrained weights are also applied for finetuning when possible.
4.2.2 Results and Analysis
The retrieval mAP@ results are reported in Table 2. The respective PR curves, Precision@ and P@H2 scores are illustrated in Figure 3.
It can be observed that JMLH consistently outperforms the compared baselines, though many of them consist of more trainable parameters, , HashGAN [5]. This result aligns with our motivation, and shows the clue that, with the current evaluation metrics, one may not require an extremely complex model to obtain the bestperforming deep hashing function.
The performance margin between JMLH and GreedyHash [35] is not significant on CIFAR10 [20], but this gap gets larger when it comes to a relatively more difficult situation, , ImageNet [28]. This raises the concern of a proper regularization term for training. Both GreedyHash [35] and JMLH are trained with classificationoriented objectives. The former literally involves a quantization penalty while JMLH considers equally distributed bits to maximize the expected code entropy. This factor becomes essential when the data label space is large and the training samples are limited as the codes need to be expressive enough to be successfully classified. We find our design has better generalization ability in this case.
4.3 Ablation Study
In this subsection, we evaluate different components in terms of formulating a simple deep hashing model, and empirically show which one is of importance for good performance.
4.3.1 Baselines
JMLHCont. We firstly look at the influence of quantization. By dropping the binary stochastic neuron and employing the sigmoid activation on the code bottleneck , a regular deep neural classifier is built. The regularization term is kept, and is subsequently analyzed by other baselines.
JMLHQR. The term of Eq. (3) is replaced by the quantization regularizer between the activated binary codes and their realvalued counterparts before the stochastic neurons.
JMLHNR. The regularizer is deprecated in this baseline, and the whole learning objective is formulated by the classification crossentropy.
JMLHVAE. We replace the classifier with a decoder, and use the reconstruction error instead of classification loss during training. Therefore, the model collapses to an unsupervised Variational AutoEncoder (VAE) [19], with a negative Evidence LowerBOund (ELBO) of
(10) 
For the simplicity of training, the encoder and decoder for this baseline are both implemented with a twolayer neural networks and are fed by AlexNet [21] fc_7 features.
4.3.2 Results and Analysis
Baseline  16 bits  32 bits  64 bits  

1  JMLHCont  0.616  0.628  0.659 
2  JMLHQR  0.778  0.827  0.835 
3  JMLHNR  0.729  0.725  0.736 
4  JMLHVAE  0.423  0.435  0.441 
5  JMLH (full model)  0.805  0.841  0.837 
The mAP results of the abovementioned baselines are shown in Table 3. Since JMLHVAE is an unsupervised model, its performance is relatively lower than the others. We experience a 20% performance drop when using the continuous relaxation during training, , JMLHCont. As discussed in Section 3, the binary constraints are essential for models like JMLH as it directly influences the classifier’s observation. Without regularization, JMLHNR struggles in the trainingtest generalization. Though not competing our full model, JMLHQR still performs closely to GreedyHash [35], as the learning objectives are similar. The difference between JMLHQR and GreedyHash [35] lies in the stochasticity of gradient estimation. Both ST [3] and distributional derivative [10] work for this case as long as the binary constraints are not violated. Hence, a proper learning objective becomes more important.
4.4 More Results
4.4.1 HyperParameter
The regularization penalty of JMLH is scaled by a hyperparameter . By default, it is set to for the overall best performance. The impact of is illustrated in Figure 4 (a). The performance drops quickly when goes larger, which actually reflects the penalty of the mutual information between data and codes , , . A large value of overregularizes the model by decorrelating with , making the produced codes lessinformative.
4.4.2 Towards Model Simplicity
One key claim of this paper is to build a simple deep hashing model. Training JMLH is nontrivial and efficient. Our classification likelihood learning objective provides a straightforward way to convey data semantics to the encoder. We show training efficiency comparison between JMLH and MIHash [4] in Figure 4 (b). It can be observed that JMLH converges more quickly to the best performance than MIHash [4] with a margin of
10 epochs. Although MIHash
[4] requires no auxiliary networks, its histogrambased learning objective introduces complex positivenegative data pairing and histogram binning. All these factors make the training of MIHash [4] indirect, resulting in relatively slower convergence rate than JMLH. Note that the performance of MIHash is slightly lower than the one reported in [4], as it was previously trained with VGG [33] features and we reproduce the results with the AlexNet [21] backbone for fair comparison.The whole parameter size of JMLH for all experiments conducted in this section is slightly smaller than AlexNet [21], as we have a relatively narrow fullyconnecting bottleneck in the middle. Compared with the models that involve endtoend generative networks [13, 5], this is believed to be a light one.
4.4.3 Extremely Short Codes
Following [35], we also explore the minimal size of codes to represent data semantics. The experiments are conducted by setting the code length to , and the corresponding results are shown in Figure 4 (c). We can see that, compared with GreedyHash [35] and DHN [43], JMLH obtains better performance even when the encoding length is very short. The entropypreserving regularization term plays the key role here since the maximum number of concepts that the code space can cover is limited.
4.4.4 Visualization Results
The tSNE [27] visualization of 32bit JMLH on CIFAR10 [20] is shown in Figure 5 (a). Even though the proposed model is simple both in terms of network structure and learning objective, the resulting binary codes are still clearly scattered in the feature space according to their semantic meanings. We further provide several image retrieval examples where the top10 retrieved candidates are shown together with the query image in Figure 5 (b). Obviously, JMLH successfully finds related images in the top of the retrieval list. Here we only show the 32bit results to keep the content concise.
5 Conclusion
In this paper, we proposed a simple but effective deep hashing model called JMLH. Our model shaped a conventional deep neural network with a single likelihood maximization learning objective. A differentiable binary bottleneck was plugged in, making the whole network endtoend trainable using SGD. JMLH was linked to the information bottleneck methods, which aimed at learning maximally representative features for a given task. We showed that, when applying proper binarypreserving gradient estimators and suitable regularization terms, a single classification model could generate highquality hash codes for similarity search, outperforming stateoftheart models.
References
 [1] (2016) Tensorflow: largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §4.1.1.
 [2] (2016) Deep variational information bottleneck. In International Conference on Learning Representations (ICLR), Cited by: 1st item, §1, §2.1.2, §2.2.1, §3.2.
 [3] (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.1, §4.3.2.

[4]
(2017)
MIHash: online hashing with mutual information.
In
IEEE International Conference on Computer Vision (ICCV)
, Cited by: §3.2, Figure 4, §4.4.2, Table 2. 
[5]
(2018)
HashGAN: deep learning to hash with pair conditional wasserstein gan.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §1, §3.2, §4.2.1, §4.2.2, §4.4.2, Table 2.  [6] (2018) Deep cauchy hashing for hamming space retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
 [7] (201710) HashNet: deep learning to hash by continuation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
 [8] (2017) HashNet: deep learning to hash by continuation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §4.1.2, §4.2.1, Table 2.
 [9] (2009) NUSwide: a realworld web image database from national university of singapore. In ACM International Conference on Image and Video Retrieval (CIVR), Cited by: 4th item, §4.1.1, §4.1.2.
 [10] (2017) Stochastic generative hashing. In International Conference on Machine Learning (ICML), Cited by: §2.1.3, §2.1.3, §3.1, §4.3.2, footnote 2.
 [11] (2016) Learning to hash with binary deep neural network. In European Conference on Computer Vision (ECCV), Cited by: §1, §3.1.
 [12] (2015) Deep hashing for compact binary codes learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
 [13] (2018) Unsupervised deep generative adversarial hashing network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.2, §4.1.2, §4.2.1, §4.2, §4.4.2, Table 2.
 [14] (2013) Iterative quantization: a procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §4.2.1, Table 2.
 [15] (2013) Iterative quantization: a procrustean approach to learning binary codes for largescale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §4.2.1, Table 2.
 [16] (2014) Generative adversarial nets. In Advances in neural information processing systems (NIPS), Cited by: §3.2.
 [17] (2017) SUBIC: a supervised, structured binary code for image search. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2.2, §3.1, §3.2.

[18]
(2015)
Adam: a method for acm symposium on theory of computing (stoc)hastic optimization
. In International Conference on Learning Representations (ICLR), Cited by: §2.1.3, §4.1.1.  [19] (2014) Autoencoding variational bayes. In International Conference on Learning Representations (ICLR), Cited by: §3.1, §4.3.1, footnote 2.
 [20] (2009) Learning multiple layers of features from tiny images. Cited by: 4th item, Figure 3, Figure 4, Figure 5, §4.1.1, §4.1.2, §4.2.2, §4.4.4.

[21]
(2012)
Imagenet classification with deep convolutional neural networks
. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Table 1, §4.1.1, §4.2.1, §4.3.1, §4.4.2, §4.4.2.  [22] (2015) Simultaneous feature learning and hash coding with deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.2, §4.1.2, §4.2.1, Table 2.
 [23] (2017) Deep sketch hashing: fast freehand sketchbased image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
 [24] (2014) Discrete graph hashing. In Advances in Neural Information Processing Systems (NIPS), Cited by: §4.2.1, Table 2.
 [25] (2012) Supervised hashing with kernels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.1, Table 2.
 [26] (2011) Hashing with graphs. In International Conference on Machine Learning (ICML), Cited by: §4.1.2, §4.2.1, Table 2.
 [27] (2008) Visualizing data using tsne. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: Figure 5, §4.4.4.
 [28] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: 4th item, §4.1.1, §4.1.2, §4.2.2.
 [29] (2015) Supervised discrete hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.2, §3.1, §4.2.1, Table 2.
 [30] Unsupervised binary representation learning with deep variational networks. International Journal of Computer Vision, DOI: 10.1007/s11263019011664. Cited by: §3.1.
 [31] (2018) Zeroshot sketchimage hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
 [32] (2017) Deep binaries: encoding semanticrich cues for efficient textualvisual cross retrieval. In IEEE International Conference on Computer Vision (ICCV), Cited by: §3.1.
 [33] (2015) Very deep convolutional networks for largescale image recognition. In International Conference in Learning Representations (ICLR), Cited by: §4.4.2.
 [34] (2018) Binary generative adversarial networks for image retrieval. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §3.2.
 [35] (2018) Greedy hash: towards fast optimization for accurate hash coding in cnn. In Advances in Neural Information Processing Systems, Cited by: §3.1, §3.2, §4.1.2, §4.2.1, §4.2.2, §4.2, §4.3.2, §4.4.3, Table 2.
 [36] (1999) The information bottleneck method. In Annual Allerton Conference on Communication, Control, and Computing, Cited by: §1, §2.2.1.
 [37] (2016) Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3.2.

[38]
(1992)
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning
. Machine learning 8 (34), pp. 229–256. Cited by: §3.1.  [39] (2014) Supervised hashing for image retrieval via image representation learning.. In AAAI Conference on Artificial Intelligence (AAAI, Cited by: §3.1, §4.1.2, §4.2.1, Table 2.
 [40] (2016) Zeroshot hashing via transferring supervised knowledge. In ACM international conference on Multimedia (MM), Cited by: §1, §3.1.
 [41] (201809) Relaxationfree deep hashing via policy gradient. In The European Conference on Computer Vision (ECCV), Cited by: §3.1, §4.1.2, §4.2.1, §4.2, Table 2.
 [42] (2018) Graph convolutional network hashing. IEEE Transactions on Cybernetics (), pp. 1–13. External Links: Document, ISSN 21682267 Cited by: §1.
 [43] (2016) Deep hashing network for efficient similarity retrieval.. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1, §4.2.1, §4.4.3, Table 2.
 [44] (2016) Fast training of tripletbased deep binary embedding networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [45] (2018) Bingan: learning compact binary descriptors with a regularized gan. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §3.2.