1 Introduction
The vast amount of data available nowadays, combined with the need to efficiently provide quick answers to users’ queries led to the development of several hashing techniques [49, 17, 12, 45, 42, 48, 7, 23, 26]. Hashing provides a way to represent objects using compact codes, that allow for performing fast and efficient queries in large object databases. Early hashing methods, e.g., Locality Sensitive Hashing (LSH) [5], focused on extracting generic codes that could, in principle, describe every possible object and information need. However, it was later established that supervised hashing, that learns hash codes that are tailored to the task at hand, can significantly improve the retrieval precision. In this way, it is possible to learn even smaller hashing codes, since the extracted code must only encode the information needs for which the users are actually interested in. However, note that the extracted hash codes must also encode part of the semantic relationships between the encoded objects, to allow for providing a meaningful ranking of the retrieved results.
Many supervised and semisupervised hashing methods have been proposed [15, 40, 46, 47, 50, 44, 37]. However, these methods were, to a great extent, heuristically developed, without a solid theory regarding the actual retrieval process. For example, many methods employ the pairwise distances between the images [15, 40, 50], or are based on sampling triplets that must satisfy specific relationships according to the given ground truth [46, 47], without a proper theoretical motivation for these choices. On the other hand, informationtheoretic measures, such as entropy and mutual information [32]
, have been proven to provide robust solutions to many machine learning problems, e.g., classification
[32]. However, very few steps towards using these measures for supervised hashing tasks have been made so far.In this paper, we provide a connection between an informationtheoretic measure, the Mutual Information (MI) [32]
, and the process of information retrieval. More specifically, we argue that mutual information can naturally model the process of information retrieval, providing a solid framework to develop retrievaloriented supervised hashing techniques. Even though MI provides a sound theoretical formulation for the problem of information retrieval, applying it in real scenarios is usually intractable, since there is no efficient way to calculate the actual probability densities, that are involved in the calculation of MI. The great amount of data as well as their high dimensionality further complicate the practical application of such measures.
The main contribution of this paper is the proposal of an efficient deep supervised hashing algorithm that optimizes the learned codes using a novel extension of an informationtheoretic measure, the Quadratic Mutual Information (QMI) [36]. The architecture of the proposed method is shown in Fig. 1.
To derive a practical algorithm that can efficiently scale to large datasets:

We adapt QMI to the needs of supervised hashing by employing a similarity measure that is closer to the actual distance used for the retrieval process, i.e., the Hamming distance. This gives rise to the proposed Quadratic Spherical Mutual Information
(QSMI). It is also experimentally demonstrated that the proposed QSMI is more robust compared to the classical Gaussianbased Kernel Density Estimation used in QMI
[36], while it does not require carefully tuning of any hyperparameters. 
We propose using a more smooth optimization objective employing a novel square clamping approach. This allows for significantly improving the stability of the optimization, while reducing the risk of converging to bad local minima.

We adapt the proposed approach to work in batchbased setting by employing a method that dynamically estimates the prior probabilities, as they are observed within each batch. In this way, the proposed method can efficiently scale to larger datasets.

We demonstrate that the proposed method can be readily extended to efficiently handle different scenarios, e.g., retrieval of unseen classes [34].
The proposed method is extensively evaluated using three image datasets, including the two standard datasets used for evaluating supervised hashing methods, the CIFAR10 [14] and NUSWIDE [4] datasets, and it is demonstrated that it outperforms the existing stateoftheart techniques. Following the suggestions of [34]
, we also evaluate the proposed method in a different evaluation setup, where the learned hash codes are evaluated using unseen information needs. A PyTorchbased implementation of the proposed method is available at
https://github.com/passalis/qsmi^{2}^{2}2The code will be available shortly after the review process., allowing any researcher to easily use the proposed method and readily reproduce the experimental results.2 Related Work
The increasing interest for learning compact hash codes, together with the great learning capacity of recent deep learning models, led to the development of several deep supervised hashing techniques. Deep supervised hashing techniques involve: a) a deep neural network, that is used to extract a representation from the data, b) a supervised loss function, that is used to train the network, and c) a hashing mechanism, e.g., an appropriate nonlinearity
[18] or regularizer [50], that ensures that the output of the network can be readily transformed into a hash code. Most of the proposed methods fall into one of the following two categories according to the loss function employed for learning the supervised codes: a) pairwisebased hashing methods [15, 40, 50, 18, 21, 10, 19, 35] and b) tripletbased hashing methods [46, 47, 39].Pairwisebased methods work by learning hash codes that minimize / maximize the pairwise distance / loglikelihood between similar / dissimilar pairs, e.g., Convolutional Neural Network (CNN)based hashing [40], network in network hashing [15], deep hashing network [50], deep pairwisesupervised hashing [19], deep hashing network [50], and deep supervised discrete hashing [18]. More advanced pairwise methods employ margins that allow for learning more regularized representations, e.g., deep supervised hashing [21], use asymmetric hashing schemes, e.g., deep asymmetric pairwise hashing [35], and asymmetric deep supervised hashing [10], or use more advanced techniques to obtain the binary codes, e.g., hashing by continuation [3]. Even though these techniques have been applied with great success, they were largely developed heuristically. On the other hand, the proposed method works by maximizing the mutual information between the learned hash codes and the ground truth.
Tripletbased methods work by sampling an anchor point along with a positive and a negative example [46, 47, 39, 6]. Then, they learn codes that increase the similarity between the anchor and the positive example, while reducing the similarity between the anchor and the negative example. However, tripletbased methods are significantly more computationally expensive than pairwisebased methods, requiring a huge number of triplets to be generated (many of which convey no information, since they are already satisfied by the code learned by the network), limiting their practical application. Also note that many nondeep supervised hashing methods have also been proposed, e.g., [11, 20, 22], but an extensive review of them is out of the scope of this paper. The interested reader is referred to [38] for an extensive literature review on hashing.
The use of MI has been also investigated to aid various aspects of the retrieval process. In [1, 8] MI is employed to provide relevance feedback, while in [2] MI is used to provide updates for online hashing. More specifically, the Shannon’s definition for MI is used in [2], leading to employing a Monte Carlo sampling scheme to approximate the MI, together with a differentiable histogram binning technique. Our approach is vastly different, since instead of approximating the MI through random sampling, we analytically derive computationally tractable solutions for calculating MI through a QMI formulation. Furthermore, we also adapt MI to the actual needs of hashing by employing a spherical formulation that is closer to the Hamming distance. Note that other informationtheoretic criteria, such as entropy [28, 29, 30], have been also employed to optimize various representations towards information retrieval.
To the best of our knowledge, this is the first work that employs a quadratic spherical mutual information loss fully adapted to the needs of deep supervised hashing. Apart from deriving a practical algorithm and demonstrating its ability to outperform existing stateoftheart methods, the proposed method provides a complete framework that can be used to model the process of information retrieval. This formulation is fully differentiable allowing for the endtoend optimization of deep neural networks for any retrievalrelated task, ranging from learning retrievaloriented representations and compact hash codes to finetuning the extracted representations using relevance feedback.
3 Proposed Method
The proposed method is presented in detail in this Section. First, the links between mutual information and information retrieval are provided. Then, the quadratic mutual information is introduced, the proposed quadratic spherical mutual information is derived and several aspects of the proposed method are discussed.
3.1 Information Retrieval and Mutual Information
Let be a collection of images, where is the representation of the th image extracted using an appropriate feature extractor, e.g., a deep neural network. Each image fulfills a set of information needs. For example, an image that depicts a “red car near a beach” fulfills at least the following information needs: “car”, “red car”, “beach”, “car near beach”. Note that the information needs that an image actually fulfills depend on both its content and the needs of the users, since, depending on the actual application, the interests of the users are usually focused on a specific area. For example, an image of a man entering a bank represents different information needs for a forensics database used by the police to identify suspects and for a generic web search engine. The problem of information retrieval can be then defined as follows: Given an information need retrieve the images of the collection that fulfill this information need and rank them according to their relevance to the given information need. This work focuses on contentbased information retrieval [16], where the information need is expressed through a query image , that is usually not part of the collection .
To be able to measure how well an information retrieval system works, a ground truth set that contains a set of information needs and the corresponding images that fulfill these information needs is usually employed. Let be the number of information needs . Then, for each information need , a set of images , where is the representation of the th image that fulfills the th information need, is given. Note that . Since all these images fulfill the same information need, they can be all used as queries to express this information need. However, there are also other images, which are usually not known beforehand, that also express the same information need and they can be also used to query the database. The distribution of the images that fulfill the th information need can be modeled using the conditional probability density function .
Let
be a random vector that represents the images and
be a random variable that represents the information needs. The Shannon’s entropy of the information needs, that expresses the uncertainty regarding the information need that a randomly sampled image fulfills, is defined as
[32]:(1) 
where is the prior probability of the information need , i.e., the probability that a random image of the collection fulfills the information need . Note that above definition implicitly assumes that the information needs are mutually exclusive, i.e., , or equivalently, that each image satisfies only one information need. This is without loss of generality, since it is straightforward to extend this definition to the general case, where each image can satisfy multiple information needs, simply by measuring the entropy of each information need separately:
To simplify the presentation of the proposed method, we assume that the information needs are mutually exclusive. Nonetheless, the proposed approach can be still used with minimal modifications, as we also experimentally demonstrate in Section 4, even when this assumption does not hold.
When the query vector is known, then the uncertainty of the information need that it fulfills can be expressed by the conditional entropy:
(2) 
Mutual information is defined as the amount by which the uncertainty for the information needs is reduced after observing the query vector:
(3) 
It is easy to see that MI can be interpreted as the KullbackLeibler divergence between
and . It is desired to maximize the MI between the representation of the images and the information needs , since this ensures that the uncertainty regarding the information need, that a query image expresses, is minimized. Also, note that MI models the intrinsic uncertainty regarding the query vectors, since it employs the conditional probability density between the information needs and the images, instead of just a limited collection of images.On the other hand, it is usually intractable to directly calculate the required probability density and the corresponding integral in (3), limiting the practical applications of MI. However, as it is demonstrated later, it is possible to efficiently estimate the aforementioned probability density and derive a practical algorithm that maximizes the MI between a representation and a set of information needs.
3.2 Quadratic Mutual Information
When the aim is not to calculate the exact value of MI, but to optimize a distribution that maximizes the MI, then a quadric divergence metric, instead of the KullbackLeibler divergence, can be used. In this way, the Quadratic Mutual Information (QMI) is defined as [36]:
(4) 
By expanding (4), QMI can be expressed as the sum of three information potentials as , where: , , and .
To calculate these quantities, the probability and the densities and must be estimated. The prior probabilities depend only on the distribution of the information needs in the collection of images. Therefore, for the th information need: , where is the number of images that fulfill the th information need. The conditional density of the images that fulfill the th information need can be estimated using the Parzen window estimation method [27]:
(5) 
where is a Gaussian kernel (in an dimensional space) with width defined as:
(6) 
Then, the joint probability density can be estimated as:
(7) 
while the density of all the images as:
(8) 
By substituting these estimations into the definitions of the information potentials, the following quantities are obtained:
(9) 
(10) 
and
(11) 
where the following property regarding the convolution between two Gaussian kernels was used: . The information potential expresses the interactions between the images that fulfill the same information need, the information potential the interactions between all the images of the collection, while the potential models the interactions of the images that fulfill a specific information need against all the other images. Therefore, the QMI formulation allows for the efficient calculation of MI, since the MI is expressed as a weighted sum over the pairwise interactions of the images of the collection.
Using Parzen window estimation with a Gaussian kernel for estimating the probability density leads to the implicit assumption that the similarity between two images is expressed through their Euclidean distance. Thus, the images that fulfill an information need expressed by a query vector can be retrieved simply using nearestneighbor search.
3.3 Quadratic Spherical Mutual Information Optimization
Even though QMI allows for more efficient optimization of distributions, it suffers from several limitations: a) QMI involves the calculation of the pairwise similarity matrix between all the images of a collection. This quickly becomes intractable as the size of the collection increases. b) Selecting the appropriate width for the Gaussian kernels is not always straightforward, as a nonoptimal choice can distort the feature space and slow down the optimization. c) The discrepancy between the distance metric used for QMI (Euclidean distance) and the distance used for the actual retrieval of the hashed images (Hamming distance) can negatively affect the retrieval accuracy. Finally, d) it was experimentally observed that directly optimizing the QMI is prune to bad local minima, due to the linear behavior of the loss function that fails to distinguish between the pairs of images that cause high error and those which have a smaller overall effect on the learned representation (more details are given later in this Section).
To overcome the limitations (b) and (c), we propose the Quadratic Spherical Mutual Information (QSMI). The proposed QSMI method replaces the Gaussian kernel in (6), used for calculating the similarity between two images in the information potentials in (9), (10), and (11
), with the cosine similarity:
(12) 
where is the norm of a vector. In this way, we maintain the computationally efficient QMI formulation and avoid the need for manually tuning the width parameter of the Gaussian kernel, while adopting a formulation that is closer to the Hamming distance that is actually used for the retrieval process. Indeed, the cosine similarity can be interpreted as a normalized Hammingbased similarity measure [22]. To understand this, consider that the Hamming “similarity” (number of bits that are equal between two binary vectors) can be calculated as , for two binary dimensional vectors . Therefore, when binary vectors are used, the cosine similarity can be interpreted as a bounded normalized version of the Hamming similarity.
Therefore, QSMI is defined as:
(13) 
where
(14) 
(15) 
and
(16) 
Note that when the information needs are equiprobable, i.e., , then QSMI can be simplified as . Therefore, when this assumption holds, QSMI can be easily implemented just by defining the similarity matrix , where and the notation is used to refer to the th row and th column of matrix . Then, QSMI can be calculated as:
(17) 
where the indicator matrix is defined as:
(18) 
The notation is used to refer an dimensional vector of 1s, while the operator denotes the Hadamard product between two matrices. Please refer to the Appendix A for a more detailed derivation. This formulation also allows for directly handling information needs that are not mutually exclusive. In that case, the values of the indicator matrix are appropriately set to 1, if two images share at least one information need.
Instead of directly optimizing the QSMI, we propose using a “square clamp” around the similarity matrix , smoothing the optimization surface. Therefore, given that the values of range in the unit interval, the loss function is rederived as:
(19) 
As shown in Figure (a)a this formulation penalizes the pairs with larger error more heavily than those with smaller error, allowing for discovering more robust solutions. This modification effectively addresses the limitation (d), as we also experimentally demonstrate in the ablation study given in Section 4.
To allow for scaling to larger datasets, batchbased optimization is used. That allows for reducing the complexity of QMI from , which is intractable for larger datasets, to just , where is the used batch size that typically ranges from 64 to 256. However, this implies that each batch will contain images only from a subsample of the available information needs. This in turn means that the observed inbatch prior probability will not match the collectionlevel prior, leading to underestimating the influence of the potential to the optimization. To account for this discrepancy, we propose a simple heuristic to estimate the inbatch prior, i.e., the value of in (27): is estimated as , where is the batch size. To understand the motivation behind this, consider that if the whole collection was used for the optimization, then the number of 1s in would be: . Solving this equation for yields the value used for approximating . Note that the value of is not constant and depends on the distribution of the samples in each batch. It was experimentally verified that this approach indeed improves the performance of the proposed method over using a constant value for .
3.4 Deep Supervised Hashing using QSMI
The proposed QSMI is used to train a deep neural network to extract short binary hash codes, as shown in Fig. 1. Let be the raw representation of an image (e.g., the pixels of an image) and let be the output of a neural network , where denotes the matrix of the parameters of the network and is the length of the hash code. Apart from learning a representation that minimizes the loss, the network must generate an output that can be easily translated into a binary hash code. Several techniques have been proposed to this end, e.g., using the tanh function [38]. In this work, the output of the network is required to be close to two possible values, either 1 or 1. Therefore, the used hashing regularizer is defined, following the recent deep supervised hashing approaches [21], as:
(20) 
where denotes the absolute value operator and denotes the norm. The final loss function is defined as:
(21) 
where is the weight of the hashing regularizer. The network can be then trained using gradient descent, i.e., , where is the used learning rate. Please refer to Appendix A on details regarding the derivation of . After training the network, the hash codes can be readily obtained using the function.
Even though learning highly discriminative hash codes is desired for retrieving data that belong to the training domain, it can negatively affect the retrieval for previously unseen information needs [34]. The proposed method can be also easily modified to optimize the hash codes toward other unsupervised information needs. Even though any source of information can be used, in this work the information needs are discovered by clustering the training data. This allows for discovering information needs dictated by the structure of the data. Let denote the loss induced by applying the QMSI loss function on these information needs. Then, the final loss function for this semisupervised variant is defined as: . This variant is used for the experiments conducted in Section 4.4.
4 Experimental Evaluation
The proposed method is extensively evaluated in this Section, using both an ablation study and comparing it to other stateoftheart methods. First, the used datasets and the employed evaluation setup are briefly described. The hyperparameters and the network architectures used for the evaluation are provided in Appendix B. Finally, an ablation study is provided and the proposed method is evaluated using three different datasets.
4.1 Datasets and Evaluation Metrics
Three image datasets are used to evaluate the proposed method in this paper: The Fashion MNIST dataset, the CIFAR10 dataset and the NUSWIDE dataset. All images were preprocessed to have zero mean and unit variance, according to the statistics of the dataset used for the training.
The Fashion MNIST dataset is composed of 60,000 training images and 10,000 test images [41]. The size of each image is pixels (grayscale images) and there is a total of 10 different classes (each one expresses a different information need). The whole training set was used to train the networks and build the database, while the test set was used to query the database and evaluate the performance of the methods.
The CIFAR10 dataset is composed of 50,000 training images and 10,000 test images [14]. The size of each image is pixels (color images) and there is a total of 10 different classes (information needs). Again, the whole training set was used to train the networks and build the database, while the test set was used to query the database and evaluate the performance of the methods.
The NUSWIDE is a largescale dataset that contains 269,648 images that belong to 81 different concepts [4]. The images were resized to pixels before feeding them to the network. Following [19], only images that belong to the 21 most frequent concepts, i.e., 195,834 images, were used for training/evaluating the methods. Each image might belong to multiple different concepts, i.e., the information needs are not mutually exclusive. For evaluating the methods, two images were considered relevant if they share at least one common concept, which is the standard protocol used for this dataset [19]. Similarly to the other two datasets, the whole training set (193,734 randomly sampled images) was used to train the networks and build the database, while 2,100 randomly sampled queries (100 from each category) were employed to evaluate the methods.
To evaluate the proposed methods, the following four metrics were used: precision, recall, mean average precision (mAP), and precision within hamming radius of 2. Nearest neighbor search using the Hamming distance was used to retrieve the relevant documents [24]. Following [24], precision is defined as , where is the number of retrieved objects and is the number of retrieved objects that fulfill the same information need as the query , while recall is defined as , where is the total number of database objects that fulfill the same information as as . The average precision (AP) is calculated at eleven equally spaced recall points (0, 0.1, …, 0.9, 1) and the mean average precision is calculated as the mean of the APs for all queries. The precision withing hamming radius of 2 is defined as: , where is the number of relevant documents withing hamming distance 2 from the query, while is the total number of documents withing hamming distance 2 from the query.
4.2 Ablation Study
Clamped  Spherical  mAP  precision () 

No  No  
Yes  No  
Yes  Yes 
A smaller dataset, the Fashion MNIST dataset [41], was used to perform an ablation study. The effect of various design choices, i.e., using or not the proposed clamped loss and spherical formulation, is evaluated in Table 1. The mean Average Precision (mAP) is averaged over 5 runs, while the code length was set to 48 bits for these experiments. Several conclusions can be drawn from the results reported in Table 1. First, employing the proposed clamped loss, instead of directly optimizing the QMI (), improves the hashing precision, confirming our hypothesis regarding the benefits of using the proposed clamped loss (as also described in the previous Section and shown in Fig. (a)a). This is also confirmed in the learning curve shown in Fig. (b)b
, where both the proposed clamped loss and MI are monitored during the optimization. Optimizing the proposed clamped loss is directly correlated with the QMI and the proposed QSMI, both of which steadily increase during the 50 training epochs. When the spherical formulation is used (QSMI method), then the mAP further increase to 86.1% from 72.7% (standard QMI formulation). The effect of the regularization parameter
is also examined in Fig. (c)c. The best performance is obtained for .The proposed method was compared to two other stateoftheart techniques, the Deep Supervised Hashing (DSH) method [21] and the Deep Pairwise Supervised Hashing (DPSH) method [19]. We carefully implemented these methods in a batchbased setting and we tuned their hyperparameters to obtain the best performance (please refer to Appendix B). The evaluation results are shown in Table 2. The proposed method is abbreviated as “QSMIH” and significantly outperforms the other two stateoftheart pairwise hashing techniques, highlighting the importance of using theoreticallysound objectives for learning deep supervised hash codes. Recall that a deep CNN, that was trained from scratch, was employed for the conducted experiments. The precision within Hamming radius of 2 is also shown in Figure (a)a. Again, the proposed method outperforms all the other methods for all the evaluated hash code lengths.
Method  12 bits  24 bits  36 bits  48 bits 

DSH  
DPSH  
QSMIH 
4.3 Supervised Hashing Evaluation
The proposed method was also evaluated using the CIFAR10 [14] and NUSWIDE [4] datasets. For the CIFAR10, a DenseNet [9]
, pretrained on the same dataset, was employed, while for the NUSWIDE dataset, a DenseNet pretrained on the ImageNet dataset was used
[33].
Method 
8 bits  12 bits  24 bits  36 bits  48 bits 

MIHash*  N/A  0.929  0.933  0.938***  0.942 
DSH**  
DPSH**  
QSMIH 
*Results as reported in [2] (using a slightly different setup), **Results using our implementation of DSH [21] and DPSH [19], ***Results for 32 bits (as reported in [2]).
The evaluation results for the CIFAR10 dataset are reported in Table 3. The proposed method is also compared to the MIHash method [2]. The proposed method outperforms all the other techniques by a large margin for small code lengths, i.e., 8 and 12 bits. For larger hash codes, the proposed method performs equally well with the DSH and DPSH methods. However, the proposed method is capable of achieving almost the same performance as the DSH and DPSH methods using less than half the bits, highlighting the expressive power of the proposed technique. Also, to the best of our knowledge, this is the best result reported in the literature for the CIFAR10 dataset (regardless the employed hashing technique). The Hamming precision within radius 2 is shown in Figure (b)b. Again, the proposed method performs significantly better for smaller code lengths, i.e., 8 and 12 bits, while matching the precision of the DPSH method for hash codes larger than 24 bits.
The proposed QSMIH method was also evaluated using the largerscale NUSWIDE dataset that contains 269,648 images that belong to 81 different concepts. Following [19], we used the images that belong to the 21 most frequent concepts, i.e., 195,834 images. Note that an image might belong to more than one concept, i.e., fulfill multiple information needs. Furthermore, instead of using a subsample of the training set, we used the whole training set (193,734 randomly sampled images) to learn the hash codes (all the methods were used in a batch setting), and a test set of 2,100 randomly sampled queries was employed to evaluate the methods. Since there are many differences in the evaluation protocol used by different papers for this dataset, we compared the proposed method to the DSH and DPSH methods using the same network and evaluation setup. The evaluation results are shown in Table 4. Again, the proposed method outperforms the rest of the evaluated methods for any code length. The same behavior is also observed in the precision results illustrated in Fig. (c)c. Note that a similar behavior for the DSH method, i.e., the precision is reduced as the code length increases after a certain point, is also reported by the authors of the DSH method [21].
Method  8 bits  12 bits  24 bits  36 bits  48 bits 

DSH 

DPSH 

QSMIH 


4.4 Retrieval of Unseen Information Needs
Finally, the proposed method was evaluated using the evaluation setup proposed in [34]
, i.e., the 75% of the classes were used to train the models and the rest 25% were used to evaluate the models. The process was repeated 5 times using different class/information needs splits and the mean and standard deviation are reported. The evaluation results are shown in Table
5. The proposed method also employed 5 unsupervised information needs (discovered using the kmeans algorithm on the 75% of the training data). The weight of the unsupervised loss in the optimization was set to
. The proposed variant, denoted by “QSMIH+U” leads to significantly more regularized representations, that do not collapse outside the training domain, increasing the mAP for unseen classes from 0.689 to 0.795, demonstrating the flexibility of the proposed approach as well as its effectiveness in this setup.Method  12 bits  24 bits  36 bits 

DSH  
DPSH  
QSMIH+U  

5 Conclusions
A deep supervised hashing algorithm, adapted to the needs of largescale hashing, that optimizes the learned codes using an novel informationtheoretic measure, the Quadratic Spherical Mutual Information, was proposed. The proposed method was evaluated using three different datasets and evaluation setups and compared to other stateoftheart supervised hashing techniques. The proposed method outperformed all the other evaluated methods regardless the size of the used dataset and training setup, exhibiting a significantly more stable behavior than the rest of the evaluated methods. More specifically, when used with a randomly initialized network, the proposed QSMIH method managed to outperform the rest of the methods by a large margin. On the other hand, when combined with powerful pretrained networks, again it yielded the best results regardless the length of the used hash code. Also, the proposed method provides theoretical justification for several existing deep supervised hashing techniques, while also paves the way for developing more advanced representation learning techniques for information retrieval using the proposed informationtheoretic formulation, e.g., handling crossmodal retrieval tasks [43].
Appendix A  Implementation Details
To simplify the implementation of the proposed method, we assume that all the information needs are equiprobable: Then, the information potentials and can be calculated as:
(22)  
and
(23)  
where we assumed that , since . Also, the information potential can be expressed using the indicator matrix :
(24) 
Therefore, the QSMI can be simplified as:
(25) 
Finally, using the proposed clamping method, the final loss function is obtained as:
(26) 
since we aim to maximize (25). Also, note that (26) can be equivalently expressed as:
(27) 
allowing for efficiently implementing the proposed method.
To implement the gradient descent algorithm, the derivative , where are the parameters of the employed neural network, must be calculated. This derivative is calculated as:
(28) 
where is the output of the used neural network for the th document. The derivative depends on the employed architecture, while the derivative of the proposed loss with respect to the hash code can be calculated as:
(29) 
where
(30) 
and
(31) 
Finally, the derivative of the cosine similarity, needed for calculating (30) and (31), can be computed as:
(32) 
The loss and the corresponding derivatives can be similarly calculated when the information needs are not equiprobable.
Appendix B  Hyperparameters and Network Architectures
The selected hyperparameters are shown in Table 6. We selected the best parameters for the two other evaluated methods, i.e., DSH and DPSH, by performing line search for each parameter. The Adam optimizer [13]
, with the default hyperparameters, was used for the optimization. The experiments were repeated 5 times and the mean value of each of the evaluated metrics is reported, except otherwise stated.
Parameter  Method  F. MNIST  CIFAR10  NUSWIDE 

Learning rate  all  0.001  0.001  0.001 
Batch size  all  128  128  128 
Epochs  all  50  5  50 
DSH  
QSMIH  
[19]  DPSH  5  
(3 for 3648 bits) 
For the experiments conducted on the Fashion MNIST dataset a relatively simple Convolutional Neural Network (CNN) architecture was employed, as shown in Table 7. The network was initialized using the default PyTorch initialization scheme [31], and it was trained from scratch for all the conducted experiments.
For the CIFAR10 dataset, a DenseNetBC190 (growth rate 40 and compression rate 2) [9], that was pretrained on the CIFAR dataset, was used. For the NUSWIDE dataset, a DenseNet201 (growth rate 32 and compression rate 2), that was pretrained on the Imagenet dataset [33], was also employed. The feature representation was extracted from the last average pooling layers of the networks. Then, two fully connected layers were used: one with neurons and rectifier activation functions [25], and one with as many neurons as the desired code length (no activation function was used for the output layer). The size of hidden layer was set to for the CIFAR10 dataset and to for the NUSWIDE dataset. To speedup the training process, we backpropagated the gradients only to the last two layers of the network, which were trained to perform supervised hashing.
Layer  Kernel Size  Filters / Neurons  Activation 

Convolution  32  ReLU [25]  
Max Pooling      
Convolution  64  ReLU [25]  
Max Pooling      
Dense    # bits   
References
 [1] Mohannad Almasri, Catherine Berrut, and JeanPierre Chevallet. A comparison of deep learning based query expansion with pseudorelevance feedback and mutual information. In Proceedings of the European Conference on Information Retrieval, pages 709–715, 2016.

[2]
Fatih Cakir, Kun He, Sarah Adel Bargal, and Stan Sclaroff.
Mihash: Online hashing with mutual information.
In
Proceedings of the IEEE International Conference on Computer Vision
, 2017.  [3] Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. Hashnet: Deep learning to hash by continuation. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
 [4] TatSeng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and YanTao Zheng. NUSWIDE: A realworld web image database from national university of singapore. In Proceedings of the of ACM Conference on Image and Video Retrieval, 2009.
 [5] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Localitysensitive hashing scheme based on pstable distributions. In Proceedings of the Annual Symposium on Computational Geometry, pages 253–262, 2004.
 [6] Cheng Deng, Zhaojia Chen, Xianglong Liu, Xinbo Gao, and Dacheng Tao. Tripletbased deep hashing network for crossmodal retrieval. IEEE Transactions on Image Processing, 27(8):3893–3903, 2018.
 [7] Guiguang Ding, Jile Zhou, Yuchen Guo, Zijia Lin, Sicheng Zhao, and Jungong Han. Largescale image retrieval with sparse embedded hashing. Neurocomputing, 257:24–36, 2017.

[8]
Jiani Hu, Weihong Deng, and Jun Guo.
Improving retrieval performance by global analysis.
In
Proceedings of the International Conference on Pattern Recognition
, volume 2, pages 703–706, 2006.  [9] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4700–4708, 2017.

[10]
QingYuan Jiang and WuJun Li.
Asymmetric deep supervised hashing.
In
Proceedings of the International Joint Conference on Artificial Intelligence
, 2018.  [11] WangCheng Kang, WuJun Li, and ZhiHua Zhou. Column sampling based discrete supervised hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1230–1236, 2016.
 [12] Vandana D Kaushik, J Umarani, Amit K Gupta, Aman K Gupta, and Phalguni Gupta. An efficient indexing scheme for face database using modified geometric hashing. Neurocomputing, 116:208–221, 2013.
 [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
 [14] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical Report, 2009.
 [15] Hanjiang Lai, Yan Pan, Ye Liu, and Shuicheng Yan. Simultaneous feature learning and hash coding with deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [16] Michael S Lew, Nicu Sebe, Chabane Djeraba, and Ramesh Jain. Contentbased multimedia information retrieval: State of the art and challenges. ACM Transactions on Multimedia Computing, Communications, and Applications, 2(1):1–19, 2006.
 [17] Peng Li, Jian Cheng, and Hanqing Lu. Hashing with dual complementary projection learning for fast image retrieval. Neurocomputing, 120:83–89, 2013.
 [18] Qi Li, Zhenan Sun, Ran He, and Tieniu Tan. Deep supervised discrete hashing. In Proceedings of the Advances in Neural Information Processing Systems, pages 2479–2488, 2017.
 [19] WuJun Li, Sheng Wang, and WangCheng Kang. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 1711–1717, 2016.

[20]
Guosheng Lin, Chunhua Shen, Qinfeng Shi, Anton Van den Hengel, and David Suter.
Fast supervised hashing with decision trees for highdimensional data.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1971–1978, 2014.  [21] Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. Deep supervised hashing for fast image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2064–2072, 2016.
 [22] Wei Liu, Jun Wang, Rongrong Ji, YuGang Jiang, and ShihFu Chang. Supervised hashing with kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2074–2081, 2012.
 [23] Lei Ma, Hongliang Li, Fanman Meng, Qingbo Wu, and King Ngi Ngan. Global and local semanticspreserving based deep hashing for crossmodal retrieval. Neurocomputing, 2018.
 [24] Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008.
 [25] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning, pages 807–814, 2010.
 [26] Shubham Pachori, Ameya Deshpande, and Shanmuganathan Raman. Hashing in the zero shot framework with domain adaptation. Neurocomputing, 275:2137–2149, 2018.

[27]
Emanuel Parzen.
On estimation of a probability density function and mode.
The annals of mathematical statistics, 33(3):1065–1076, 1962.  [28] Nikolaos Passalis and Anastasios Tefas. Entropy optimized featurebased bagofwords representation for information retrieval. IEEE Transactions on Knowledge & Data Engineering, (7):1664–1677, 2016.
 [29] Nikolaos Passalis and Anastasios Tefas. Learning neural bagoffeatures for largescale image retrieval. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(10):2641–2652, 2017.
 [30] Nikolaos Passalis and Anastasios Tefas. Learning bagofembeddedwords representations for textual information retrieval. Pattern Recognition, 81:254–267, 2018.
 [31] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [32] Jose C Principe. Information theoretic learning: Renyi’s entropy and kernel perspectives. Springer Science & Business Media, 2010.
 [33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [34] Alexandre Sablayrolles, Matthijs Douze, Nicolas Usunier, and Hervé Jégou. How should we evaluate supervised hashing? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1732–1736, 2017.
 [35] Fumin Shen, Xin Gao, Li Liu, Yang Yang, and Heng Tao Shen. Deep asymmetric pairwise hashing. In Proceedings of the ACM on Multimedia Conference, pages 1522–1530, 2017.
 [36] Kari Torkkola. Feature extraction by nonparametric mutual information maximization. Journal of Machine Learning Research, 3:1415–1438, 2003.
 [37] Di Wang, Xinbo Gao, and Xiumei Wang. Semisupervised constraints preserving hashing. Neurocomputing, 167:230–242, 2015.
 [38] Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):769–790, 2018.
 [39] Xiaofang Wang, Yi Shi, and Kris M Kitani. Deep supervised hashing with triplet labels. In Proceedings of the Asian Conference on Computer Vision, pages 70–84, 2016.
 [40] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. Supervised hashing for image retrieval via image representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2156–2162, 2014.
 [41] Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 [42] Xing Xu, Li He, Atsushi Shimada, Rinichiro Taniguchi, and Huimin Lu. Learning unified binary codes for crossmodal retrieval via latent semantic hashing. Neurocomputing, 213:191–203, 2016.
 [43] Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. Learning discriminative binary codes for largescale crossmodal retrieval. IEEE Transactions on Image Processing, 26(5):2494–2507, 2017.
 [44] Chengwei Yao, Jiajun Bu, Chenxia Wu, and Gencai Chen. Semisupervised spectral hashing for fast similarity search. Neurocomputing, 101:52–58, 2013.
 [45] Tao Yao, Xiangwei Kong, Haiyan Fu, and Qi Tian. Semantic consistency hashing for crossmodal retrieval. Neurocomputing, 193:250–259, 2016.
 [46] Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo, and Lei Zhang. Bitscalable deep hashing with regularized similarity learning for image retrieval and person reidentification. IEEE Transactions on Image Processing, 24(12):4766–4779, 2015.
 [47] Fang Zhao, Yongzhen Huang, Liang Wang, and Tieniu Tan. Deep semantic ranking based hashing for multilabel image retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1556–1564. IEEE, 2015.
 [48] Wanqing Zhao, Hangzai Luo, Jinye Peng, and Jianping Fan. Spatial pyramid deep hashing for largescale image retrieval. Neurocomputing, 243:166–173, 2017.
 [49] Liang Zheng, Shengjin Wang, and Qi Tian. Coupled binary embedding for largescale image retrieval. IEEE Transactions on Image Processing, 23(8):3368–3380, 2014.
 [50] Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. Deep hashing network for efficient similarity retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 2415–2421, 2016.
Comments
There are no comments yet.