I Introduction
For decades, image hashing has been an active research field in vision community [1, 2, 3, 4] due to its advantages in storage and computation speed for similarity search/retrieval under specific conditions [2]. Firstly, the binary code should be short so as to the whole hash table can fit in the memory. Secondly, the binary code should preserve the similarity, i.e., (dis)similar images have (dis)similar hashing codes in the Hamming distance space. Finally, the algorithm to learn parameters should be fast and for unseen samples, the hashing method should produce the hash codes efficiently. It is very challenging to simultaneously satisfy all three requirements, especially, under the binary constraint which leads to an NPhard mixedinteger optimization problem. In this paper, we aim to tackle all these challenging conditions and constraints.
The proposed hashing methods in literature can be categorized into dataindependence [5, 6, 7] and datadependence; in which, the latter recently receives more attention in both (semi)supervised [8, 9, 10, 11, 12] and unsupervised [13, 14, 2, 15, 16, 17, 18, 19, 20, 21] manners. However, in practice, labeled datasets are limited and costly; hence, in this work, we focus only on the unsupervised setting. We refer readers to recent surveys [22, 23, 24, 25] for more detailed reviews of dataindependent/dependent hashing methods.
Ia Related works
The most relevant work to our proposal is Iterative Quantization (ITQ) [2]
, which is a very fast and competitive hashing method. The fundamental of ITQ is two folds. Firstly, to achieve lowdimensional features, it uses the wellknown Principle Component Analysis (PCA) method. PCA maximizes the variance of projected data and keeps dimensions pairwise uncorrelated. Hence, the lowdimension data, projected using the top PCA component vectors, can preserve data similarity well. Secondly, minimizing the binary quantization loss using an orthogonal rotation matrix strictly maintains the data pairwise distance. As a result, ITQ learns binary codes that can highly preserve the local structure of the data. However, optimizing these two steps separately, especially when no binary constraint is enforced in the first step, i.e., PCA, leads to suboptimal solutions. In contrast, we propose to jointly optimize the projection variation and the quantization loss.
Other works that are highly relevant to our proposed method are Binary Autoencoder (BA)
[13] and UHBDNN [14]. In these methods, the authors proposed to combine the data dimension reduction and binary quantization into a single step by using encoder of autoencoder, while the decoder encourages (dis)similar inputs map to (dis)similar binary codes. However, the reconstruction criterion is not a direct way for preserving the similarity [14]. Additionally, although achieving very competitive performances, UHBDNN is based on the deep neural network (DNN); hence, it is difficult to produce the binary code computationallyefficiently.leverage the powerful capability of Convolution Neuron Network (CNN) to jointly learn the image representations and binary codes. However, due to the nonsmooth property of the binary constraint causing the illgradient in backpropagation, these methods resort to relaxation or approximation. As a result, even thought achieving highdiscriminative image representations, these methods can only produce suboptimal binary codes. In the paper, we show that by directly considering the binary constraint, our methods can obtain much better binary codes. Hence, higher retrieval performances can be achieved.
IB Contributions
In this work, to address the problem of learning to preserve data affinity in lowdimension binary codes, (i) we first propose a novel loss function to learn a single linear transformation under the
column orthonormal constraint^{1}^{1}1Please refer to section IC for our term definitions. in the unsupervised manner that compresses and binarizes the input data jointly. The approach is named as Simultaneous Compression and Quantization (SCQ). Noted that the idea of jointly compressing and binarizing data has been explored in [13, 14]. However, due to the difficulty of the nonconvex orthogonal constraint, these works try to relax the orthogonal constraint and resort to the reconstruction criterion as an indirect way to handle the similarity perserving concern. Our work is the first one to tackle the similarity concern by enforcing strict orthogonal constraints.(ii) Under the strict orthogonal constraints, we conduct analysis and experiments to show that our formulation is able to retain a high amount of the variation and achieve small quantization loss, which are important requirements in hashing for image retrieval [2, 13, 14]. As a result, this leads to improved accuracy as demonstrated in our experiments.
(iii) We then propose to relax the column orthonormal constraint to column orthogonal constraint on the transformation matrix. The relaxation not only helps to gain extra retrieval performances but also significantly improves the training time.
(iv) Our proposed loss functions, with column orthonormal and orthogonal constraints, are confronted with two main challenges. The first is the binary constraint, which is the traditional and wellknown difficulty of hashing problem [1, 2, 3]. The second challenge is the nonconvex nature of the orthonormal/orthogonal constraint [28]. To tackle the binary constraint, we propose to apply an alternating optimization with an auxiliary variable. Additionally, we resolve the orthonormal/orthogonal constraint by using the cyclic coordinate descent approach to learn one column of the projection matrix at a time while fixing the others. The proposed algorithms are named as Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE).
(v) Comprehensive experiments on common benchmark datasets show considerable improvements on retrieval performance of proposed methods over other stateoftheart hashing methods. Additionally, the computational complexity and training / onlineprocessing time are also discussed to show the computational efficiency of our methods.
IC Notations and Term definitions
We first introduce the notations. Given a zerocentered dataset which consists of images and each image is represented by a
dimension feature descriptor, our proposed hashing methods aim to learn a column orthonormal/orthogonal matrix
which simultaneously compresses input data to dimensional space, while retains a high amount of variation, and quantizes to binary codes .It is important to note that, in this work, we abuse the terms: column orthonormal/orthogonal matrix. Specifically, the term column orthonormal matrix is used to indicate the matrix that , where is the identity matrix. While the term column orthogonal matrix indicates matrix that , where is an arbitrary diagonal matrix. Noted that the word “column” means that columns of the matrix are pairwise independent.
We define
as the eigenvalues of the covariance matrix
sorted in descending order. Finally, let be the th columns of respectively.For the remaining of the paper is organized as follow. Firstly, Section II presents in details our proposed hashing method, i.e. Orthonormal Encoder (OnE) and provide the analysis to show that our method can retain a high amount of variation and achieve small quantization loss. Section III presents a relax version of OnE, i.e. Orthogonal Encoder (OgE). Section IV presents experiment results to validate the effectiveness of our proposed methods. We conclude the paper in Section V.
Ii Simultaneous Compression & Quantization: Orthonormal Encoder
Iia Problem Formulation
In order to jointly learn data dimension reduction and binary quantization using a single linear transformation , we propose to solve the following constrained optimization:
(1) 
where denotes the Frobenius norm. Additionally, the orthonormal constrained on the column of is necessary to make sure no redundant information is captured in binary codes [29] and the projection vectors do not scale up/down projected data.
It is noteworthy to highlight the differences between our loss function Eq. (1) and the binary quantization loss function of ITQ [2]. Firstly, different from ITQ, which works on the compressed lowdimensional feature space after using PCA, i.e., ; our approach, instead, works directly on the original highdimensional feature space . This leads to the second main difference that the nonsquare column orthonormal matrix simultaneously (i) compresses data to lowdimension and (ii) quantizes to binary codes. However, it is important to note that solving for a nonsquare projection matrix is challenging. To handle this difficulty, ITQ propose to solve the data compression and binary quantization problems in two separated optimizations. Specifically, it applys PCA to compress data to dimension, and then uses the Orthogonal Procrustes approach [30] to learn a square rotation matrix to optimize binary quantization loss. However, there is a limitation in ITQ approach as no consideration for the binary constraint in the data compression step, i.e., PCA. Consequently, the solution is suboptimal. In this paper, by adopting recent advance in cyclic coordinate descent approach [12, 14, 31, 32], we propose a novel and efficient algorithm to resolve the ITQ limitation by simultaneously attacking both problems in a single optimization problem under the strict orthogonal constraint. Hence, our optimization can lead to a better optimal solution.
IiB Optimization
In this section, we discuss the key details of the algorithm (Algorithm 1) for solving the optimization problem Eq. (1). In order to handle the binary constraint in Eq. (1), we propose to use alternating optimization over and .
IiB1 Fix and update
When is fixed, the problem becomes exactly the same as when fixing rotation matrix in ITQ [2]. To make the paper selfcontained, we repeat the explaination of [2]. By expanding the objective function in Eq. (1), we have
(2) 
where . Because is fixed, so is fixed, minimizing (2) is equivalent to maximizing
(3) 
where and denotes elements of and respectively. To maximize this expression with respect to , we need to have whenever and otherwise. Hence, the optimal value of can be simply achieved by
(4) 
IiB2 Fix and update
When fixing , the optimization is no longer a mixinteger problem. However, the problem is still nonconvex and difficult to solve due to the orthonormal constraint [28]. It is important to note that is not a square matrix. It means that the objective function is not the classic Orthogonal Procrustes problem [30]. Hence, we cannot achieve the closedform solution for as proposed in [2]. To the best of our knowledge, there is no easy way for achieving the closedform solution of nonsquare . Hence, in order to overcome this challenge, inspired by PCA and recent method cyclic coordinate descent [12, 14, 31, 32], we iteratively learn one vector, i.e., one column of , at a time. We now consider two cases for and .

st vector
(5) 
where is the norm.
Let be the Lagrange multiplier, we formulate the Lagrangian :
(6) 
By minimizing over , we can achieve:
(7) 
given that maximizes the dual function ^{2}^{2}2The dual function can be simply constructed by substituting from Eq. (7) into Eq. (6). of [33]. Equivalently, should satisfy the following conditions:
(8) 
where is the smallest eigenvalue of .
In Eq. (8), the first condition is to make sure that is nonsingular and the second condition is achieved by setting the derivative of with regard to equal to .
The second equation in Eq. (8) can be recognized as a order polynomial equation of which has no explicit closedform solution for when . Fortunately, since is a concave function of , is monotonically decreasing. Hence, we can simply solve for using binary search with a small errortolerance . Note that:
(9) 
thus always has a solution.

th vector
For the second vector onward, besides the unitnorm constraint, we also need to make sure that the current vector is independent with its previous vectors.
(10) 
Let and be the Lagrange multipliers, we also formulate the Lagrangian :
(11) 
Minimizing over , similar to Eq. (7), we can achieve:
(12) 
given that satisfy the following conditions which make the corresponding dual function maximum:
(13) 
where
(14) 
in which .
There is also no straightforward solution for . In order to resolve this difficulty, we propose to use alternative optimization to solve for and . In particular, (i) given a fixed (initialized as ), we find using binary search as discussed above. Additionally, similar to , there is always a solution for . Then, (ii) with fixed , we can get the closedform solution for as . Note that since the dual function is a concave function of , alternative optimizing between and still guarantees the solution to approach the global optimal one.
IiC Retained variation and quantization loss
In the hashing problem for image retrieval, both retained variation and quantization loss are important. In this section, we provide analysis to show that, when solving Eq. (1), it is possible to retain a high amount of the variation and achieve small quantization loss. As will be discussed in more details, this can be accomplished by applying an appropriate scale S on the input dataset. Noticeably, by applying any positive scale ^{3}^{3}3For simplicity, we only discuss positive value . Negative value should have similar effects. on the dataset, the local structure of data is strictly preserved, i.e., the ranking nearest neighbor set of every data point is always the same. Therefore, in the hashing problem for retrieval task, it is equivalent to work on a scaled version of the dataset, i.e., . We can rewrite the loss function of Eq. (1) as following:
(15) 
where is the elementwise operation to find the absolute values and is the all1 matrix. In what follows, we discuss how can affect the retained variation and quantization loss.
IiC1 Maximizing retained variation
We recognize that by scaling to the dataset by an appropriate scale , such that all projected data points are inside the hypercube of , i.e., , the maximizing retained variation problem (PCA) can achieve similar results to the minimizing quantization loss problem, i.e., . Intuitively, we can interpret the former problem, i.e., PCA, as to find the projection that maximizes the distances of projected data points from the coordinate origin. While the latter problem, i.e., minimizing binary quantization loss, tries to find the projection matrix that minimizes the distances of projected data points from or correspondingly. A simple 1D illustration to explain the relationship between two problems is given in Figure 2.
Since each vector of is constrained to have the unit norm, the condition actually can be satisfied by scaling the dataset by to have all data points in the original space inside the hyperball with unit radius, in which is equal to the largest distance between data points and the coordinate origin.
IiC2 Minimizing quantization loss
Regarding the quantization loss (Eq. 15), which is a convex function of , by setting , we have the optimal solution for as following:
(16) 
where is the all0 matrix.
Considering , there are two important findings. Firstly, there is obviously no scaling value that can concurrently achieve and , except the case which is unreal in practice. Secondly, from Eq. (16), we can recognize that as gets larger, i.e., gets smaller, minimizing the loss will produce that focuses on lowervariance directions so as to achieve smaller as well as smaller . It means that gets closer to the global minimum of . Consequently, the quantization loss becomes smaller. In Figure 3, we show a toy example to illustrate that as increases, minimizing quantization loss diverts the projection vector from topPCA component (Figure 2(a)) to smaller variance directions (Figure 2(b) 2(c)), while the quantization loss (per bit) gets smaller (Figure 2(a) 2(c)). In summary, as gets smaller, the quantization loss is smaller and vice versa. However, note that keeping increasing when already focuses on leastvariance directions will make the quantization loss larger.
Note that the scale is a hyperparameter in our system. In the experiment section (Section IVB), we will additionally conduct experiments to quantitatively analyse the effect of the scale hyperparameter and determine proper values using validation dataset.
Iii Simultaneous Compression & Quantization: Orthogonal Encoder
Iiia Problem Reformulation: Orthonormal to Orthogonal
In Orthonormal Encoder (OnE), we work with the column orthonormal constraint on . However, we recognize that relaxing this constraint to column orthogonal constraint, i.e., relaxing the unit norm constraint on each column of , by converting it into a penalty term, provides three important advantages. We now achieve the new loss function as following:
(17) 
where is a fixed positive hyperparameter to penalize large norms of . It is important to note that, in Eq. (17), we still enforce the strict pairwise independent constraint of projection vectors to ensure no redundant information is captured.
Firstly, with an appropriately large , the optimization prefers to choose large variance components of since this helps to achieve the projection vectors that have smaller norms. In other words, without penalizing large norms of , the optimization has no incentive to focus on high variance components of since it can produce projection vectors with arbitrary large norms that can scale any components appropriately to achieve minimum binary quantization loss. Secondly, this provides more flexibility of having different scale values for different directions. Consequently, relaxing the unitnorm constraint of each column of helps to mitigate the difficulty of choosing the scale value . However, it is important to note that a too large , on the other hand, may distract the optimization from minimizing the binary quantization term. Finally, from OnE Optimization (Section IIB), we observed that the unit norm constraint on each column of makes the OnE optimization difficult to be solved efficiently since there is no closedform solution for . By relaxing this unit norm constraint, we now can achieve the closedform solutions for ; hence, it is very computationally beneficial. We will discuss more about the computational aspect in section IIIC.
IiiB Optimization
Similar to the Algorithm 1 for solving Orthonormal Encoder, we apply alternative optimize and with the step is exactly the same as Eq. (4). For step, we also utilize the cyclic coordinate descent approach to iteratively solve , i.e., column by column. The loss functions are rewritten and their corresponding closedform solutions for can be efficiently achieved as following:

st vector
(18) 
We can see that Eq. (18) is the regularized least squares problem, whose closedform solution is given as:
(19) 

th vector
(20) 
Given the Lagrange multiplier , similar to Eq. (7) and Eq. (11), we can obtain as following:
(21) 
where , in which
(22) 
and .
Note that, given a fixed , is a constant matrix, the matrix contains matrix in the topleft corner. It means that only the th row and column of matrix are needed to be computed. Thus, can be solved even more effectively.
IiiC Complexity analysis
The complexity of the two algorithms, OnE and OgE, are shown in Table I. In our empirical experiments, is usually around 50, is at most 10 iterations, and (for CNN fullyconnected features (Section IVA)). Firstly, we can observe that OgE is very efficient as its complexity is only linearly depended on the number of training samples , feature dimension , and code length . In addition, OgE is also faster than OnE. Furthermore, as our methods aim to learn the projection matrices that preserve highvariance components, it is unnecessary to work on very high dimensional features. As there are many lowvariance/noisy components, which will be discarded eventually. In practice, we observe no performance drop when applying PCA to compress feature to much lower dimensions, e.g., 512D. While this helps to achieve significant speedup in training time for both algorithms, especially for the OnE, as its time complexity is depended on for large . In addition, we conduct experiments to measure the actual running time of the algorithms and compare with other methods in section IVD.
Computational complexity  

OnE  
OgE 
Iv Experiments
Iva Datasets, Evaluation protocols, and Implementation notes
Dataset  CIFAR10 [34]  LabelMe1250k [35]  SUN397 [36]  

L  8  16  24  32  8  16  24  32  8  16  24  32  
mAP  SpH [16]  17.09  18.77  20.19  20.96  11.68  13.24  14.39  14.97  9.13  13.53  16.63  19.07 
KMH [15]  22.22  24.17  24.71  24.99  16.09  16.18  16.99  17.24  21.91  26.42  28.99  31.87  
BA [13]  23.24  24.02  24.77  25.92  17.48  17.10  17.91  18.07  20.73  31.18  35.36  36.40  
ITQ [2]  24.75  26.47  26.86  27.19  17.56  17.73  18.52  19.09  20.16  30.95  35.92  37.84  
SCQ  OnE  27.08  29.64  30.57  30.82  19.76  21.96  23.61  24.25  23.37  34.09  38.13  40.54  
SCQ  OgE  26.98  29.33  30.65  31.15  20.63  23.07  23.54  24.68  23.44  34.73  39.47  41.82  
prec@r2  SpH  18.04  30.58  37.28  21.40  11.72  19.38  25.14  13.66  6.88  23.68  37.21  27.39 
KMH  21.97  36.64  42.33  27.46  15.20  26.17  32.09  18.62  9.50  36.14  51.27  39.29  
BA  23.67  38.05  42.95  23.49  16.22  25.75  31.35  13.14  10.50  37.75  50.38  41.11  
ITQ  24.38  38.41  42.96  28.63  15.86  25.46  31.43  17.66  9.78  35.15  49.85  46.34  
SCQ  OnE  24.48  36.49  41.53  43.90  16.69  27.30  34.63  33.04  8.68  30.12  43.54  50.41  
SCQ  OgE  24.35  38.30  43.01  44.01  16.57  27.80  34.77  34.64  8.76  29.31  45.03  51.88  
prec@1k  SpH  22.93  26.99  29.50  31.98  14.07  16.78  18.52  19.27  10.79  15.36  18.21  20.07 
KMH  32.30  33.65  35.52  37.77  21.07  20.97  21.41  21.98  18.94  24.93  25.74  28.26  
BA  31.73  34.16  35.67  37.01  21.14  21.71  22.64  22.83  19.22  28.68  31.31  31.80  
ITQ  32.40  36.35  37.25  37.96  21.01  22.00  22.98  23.63  18.86  28.62  31.56  32.74  
SCQ  OnE  33.38  37.82  39.13  40.40  22.91  25.39  26.55  27.16  19.26  29.95  32.72  34.08  
SCQ  OgE  33.41  38.33  39.54  40.70  23.94  25.94  26.99  27.46  20.10  29.95  33.43  35.00 
The Bold and Underline values indicate the best and second best performances respectively.
The CIFAR10 dataset [34] contains fullyannotated color images of from object classes ( images for each class). The provided test set ( images for each class) is used as the query set. The remaining 50,000 images are used as training set and database.
The LabelMe1250k dataset [35] consists of fully annotated color images of of object classes, which is a subset of LabelMe dataset [37]. In this dataset, for image having multiple label values in , the object class of the largest label value is chosen as the image label. We also use the provided test set as query set and the remaining images as training set and database.
The SUN397 dataset [36] contains about fully annotated color images from scene categories. We select a subset of categories which contain more than 500 images to construct a dataset of about images in total. We then randomly sample 100 images per class to form the query set. The remaining images are used as training set and database.
For these above image datasets, each image is represented by a 4096D feature vector extracted from the fullyconnected layer 7 of pretrained VGG [38].
Evaluation protocols.
As datasets are fully annotated, we use semantic labels to define the ground truths of image queries. We apply three standard evaluation metrics, which are widely used in literature
[13, 39, 2], to measure the retrieval performance of all methods: 1) mean Average Precision (mAP()); 2) precision at Hamming radius of 2 (prec@r2 ()) which measures precision on retrieved images having Hamming distance to query (we report zero precision for the queries that return no image); 3) precision at top 1000 return images (prec@1k ()) which measures the precision on the top 1000 retrieved images.Implementation notes. As discussed in section IIIC, for computational efficiency, we apply PCA to reduce the feature dimension to 512D for our proposed methods. The hyperparameter of OgE algorithm is empirically set as for all experiments. Finally, for both OnE and OgE, we set all errortolerance values, , as and the maximum number of iteration is set as . The implementation of our methods is available at \(https://github.com/hnanhtuan/SCQ.git\).
For all compared methods, e.g., Spherical Hashing (SpH) [16]
, Kmeans Hashing (KMH)
^{4}^{4}4Due to very long training time at highdimension of KMH [15], we apply PCA to reduce dimension from 4096D to 512D. Additionally, we execute experiments for KMH with and report the best results. [15], Binary Autoencoder (BA) [13], and Iterative Quantization (ITQ) [2]; we use the implementation with suggested parameters provided by the authors. Besides, to improve the statistical stability in the results, we report the average values of 5 executions.IvB Effects of parameters
As discussed in section IIC, when decreases, the projection matrix can be learned to retain a very high amount of variation, as much as PCA can. However, it causes undesirable large binary quantization loss and vice versa. In this section, we additionally provide quantitative analysis of effects of the scale parameter on these two factors and, moreover, on the retrieval performance.
In this experiment, for all datasets, e.g., CIFAR10, LabelMe1250k, and SUN397, we random select 20 images for each class in the training set (as discussed in section IVA) for validation set. The remaining images are used for training. To obtain each data point, we solve the problem Eq. (1) at various scale values and use OnE algorithm (Algorithm 1  Section IIB) to tackle the optimization.
Figure 4 presents (i) the quantization loss per bit, (ii) the percentage of total retained variation by the minimizing quantization loss projection matrix in comparison with the total retained variation of top PCA components as varies, and (iii) the retrieval performance (mAP) of the validation sets. Firstly, we can observe that there is no scale that can simultaneously maximizes the retained variation and minimizes the optimal quantization loss. On the one hand, as the scale value decreases, minimizing the loss function Eq. (15) produces a projection matrix that focuses on highvariance directions, i.e., retains more variation in comparison with PCA (red line). On the other hand, at smaller , the quantization loss is much larger (blue dash line). The empirical results are consistent with our discussion in section IIC.
Secondly, regarding the retrieval performance, unsurprisingly, the performance drops as the scale gets too small, i.e., a high amount of variation is retained but the quantization loss is too large, or gets too large, i.e., the quantization loss is small but only low variance components are retained. Hence, it is necessary to balance these two factors. As data variation varies from dataset to dataset, the scale value should be determined from the dataset. In particular, we leverage the eigenvalues , which are the variances of PCA components, to determine this hyperparameter. From experimental results in Figure 4, we propose to formulate the scale parameter as:
(23) 
This setting can generally achieve the best performances across multiple datasets, feature types, and hash lengths, without resort to conducting multiple trainings and crossvalidations. The proposed working points of the scale are shown in Figure 4. We apply this scale parameter to the datasets for both OnE and OgE algorithms in all later experiments.
Note that the numerator of the fraction in Eq. 23, i.e., is the hash code length, which is also the total variation of binary codes . In addition, the denominator is the total variation of top th PCA components, i.e. the maximum amount of variation that can be retained in an dimension feature space. Hence, we can interpret the scale as a factor that make the amounts of variation, i.e. energy, of the input and output (i.e. binary codes ) are comparable. This property is important as when the variation of input is much larger than the variation of output, obviously there is some information loss. On the other hand, when the variation of output is larger than it of input, the output contains undesirable additional information.
IvC Comparison with stateoftheart
In this section, we evaluate our proposed hashing methods, SCQ  OnE and OgE, and compare to the stateoftheart unsupervised hashing methods including SpH, KMH, BA, and ITQ. The experimental results in mAP, prec@r2 and prec@1k are reported in Table II. Our proposed methods clearly achieves significant improvement over all datasets at the majority of evaluation metrics. The improvement gaps are clearer at higher code lengths, i.e., L = 32. Additionally, OgE generally achieves slightly higher performance than OnE. Moreover, it is noticeable that, for prec@r2, all compared methods suffer performance downgrade at long hash code, e.g., . However, our proposed methods still achieve good prec@r2 at . This shows that binary codes producing by our methods highly preserve data similarity.
Methods  mAP  prec@r2  

16  32  16  32  
CIFAR10  DH [39]  16.17  16.62  23.33  15.77 
UHBDNN [14]  17.83  18.52  24.97  18.85  
SCQ  OnE  17.97  18.63  24.57  23.72  
SCQ  OgE  18.00  18.78  24.15  25.69 
Comparison with Deep Hashing (DH) [39] and Unsupervised Hashing with Binary Deep Neural Network (UHBDNN) [14]. Recently, there are several methods [39, 14] applying DNN to learn binary hash code, which achieved very competitive performances. Hence, in order to have a complete evaluation, following the experiment settings of [39, 14], we conduct experiments on CIFAR10 dataset. In this experiment, 100 images are randomly sampled for each class as a query set; the remaining images are for training and database. Each image is presented by a GIST 512D descriptor [40]. In addition, to avoid bias results due to test samples, we repeat the experiment 5 times with 5 different random training/query sets. The comparative results in term of mAP and prec@r2 are presented in Table III. Our proposed methods are very competitive with DH and UHBDNN, specifically achieving higher mAP and prec@r2 at than DH and UHBDNN.
Methods  CIFAR10  NUSWIDE  

12  24  32  48  12  24  32  48  
mAP  BGAN [27]  40.1  51.2  53.1  55.8  67.5  69.0  71.4  72.8 
SCQ  OnE  53.59  55.77  57.62  58.14  69.82  70.53  72.78  73.25  
SCQ  OgE  53.83  55.65  57.74  58.44  70.17  71.31  72.49  72.95 
Comparison with Binary Generative Adversarial Networks for Image Retrieval (BGAN) [27]. Recently, BGAN applies a continuous approximation of sign function to learn the binary codes which can help to generate images plausibly similar to the original images. The method has been proven to achieve outstanding performances in unsupervised image hashing task. We note that BGAN is different with our method and compared methods in that BGAN jointly learns image feature representations and binary codes, in which the binary codes are achieved by using an approximate smooth function of sign
. While ours and compared methods learn the optimal binary codes given image representations. Hence, to further validate the effectiveness of our methods and to compare with BGAN, we apply our method on the FC7 features extracted from the feature extraction component in the pretrained BGAN model on CIFAR10 and NUSWIDE
[41] datasets. In this experiment, we aim to show that by applying our hashing methods on the pretrained features from feature extraction component of BGAN, our methods can produce better hash codes than the jointly learning approach of BGAN. Similar to BGAN [27], for both CIFAR10 and NUSWIDE, we randomly select 100 images per class as the test query set; the remaining images are used as database for retrieval. We then randomly sample from the database set 1,000 images per class as the training set. The Table IV shows that by using the more discriminative features from the pretrained feature extraction component of BGAN, our methods can outperform BGAN, i.e., our methods can produce better binary codes in comparison to the sign approximate function in BGAN, and achieve the stateoftheart performances in the unsupervised image hashing task.IvD Training time and Processing time
In this experiment, we empirically evaluate the training time and online processing time of our methods. The experiments are carried out on a workstation with a 4core i76700 CPU @ 3.40GHz. The experiments are conducted on the combination of CIFAR10, Labelme1250k, and SUN397 datasets. For OnE and OgE, the training time include time for applying zeromean, scaling, reducing dimension to . We use 50 iterations for all experiments. The Fig. 5 shows that our proposed methods, OnE and OgE, are very efficient. OgE is just slightly slower than ITQ [2]. Even though OnE is slower than OgE and ITQ, it takes just over a minute for 100.000 training samples which is still very fast and practical, in comparison with several dozen minutes for KMH [15], BA [13] and UHBDNN [14]^{5}^{5}5For training 50000 CIFAR10 samples using author’s release code and dataset [14]..
Compared with training cost, the time to produce new hash codes is more important since it is done in real time. Similar to SemiSupervised Hashing (SSH) [29] and ITQ [2], by using only a single linear transformation, our proposed methods require only one BLAS operation (gemv or gemm) and a comparison operation; hence, it takes negligible time to produce binary codes for new data points.
V Conclusion
In this paper, we successfully addressed the problem of jointly learning to preserve data pairwise (dis)similarity in lowdimension space and minimize the binary quantization loss with strict diagonal constraint. Additionally, we show that as more variation is retained, it causes undesirable large quantization loss and vice versa. Hence, by appropriately balancing these two factors using a scale, our methods can produce better binary codes. Extensive experiments on various datasets show that our proposed methods, Simultaneous Compression and Quantization (SCQ): Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE), outperform other stateoftheart hashing methods by clear margins under various standard evaluation metrics and benchmark datasets. Furthermore, OnE and OgE are very computationally efficient in both training and testing steps.
References
 [1] A. Andoni and P. Indyk, “Nearoptimal hashing algorithms for approximate nearest neighbor in high dimensions,” Commun. ACM, vol. 51, Jan. 2008.
 [2] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in CVPR, 2011.
 [3] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, 2009.
 [4] D. Zhang, J. Wang, D. Cai, and J. Lu, “Selftaught hashing for fast similarity search,” in ACM SIGIR, 2010.
 [5] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB, 1999.
 [6] B. Kulis and K. Grauman, “Kernelized localitysensitive hashing for scalable image search,” in ICCV, Nov 2009.
 [7] M. Raginsky and S. Lazebnik, “Localitysensitive binary codes from shiftinvariant kernels,” in NIPS, 2009.
 [8] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in NIPS, 2009.

[9]
G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter, “Fast supervised hashing with decision trees for highdimensional data,” in
CVPR, 2014.  [10] W. Liu, J. Wang, R. Ji, Y. G. Jiang, and S. F. Chang, “Supervised hashing with kernels,” in CVPR, 2012.
 [11] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in NIPS, 2012.
 [12] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete hashing,” in CVPR, 2015.
 [13] M. Á. CarreiraPerpiñán and R. Raziperchikolaei, “Hashing with binary autoencoders,” in CVPR, 2015.
 [14] T.T. Do, A.D. Doan, and N.M. Cheung, “Learning to hash with binary deep neural network,” in ECCV, 2016.
 [15] K. He, F. Wen, and J. Sun, “Kmeans hashing: An affinitypreserving quantization method for learning binary compact codes,” in CVPR, 2013.
 [16] J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon, “Spherical hashing,” in CVPR, 2012.
 [17] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, and H. T. Shen, “Unsupervised deep hashing with similarityadaptive and discrete optimization,” IEEE TPAMI, pp. 1–1, 2018.
 [18] M. Hu, Y. Yang, F. Shen, N. Xie, and H. T. Shen, “Hashing with angular reconstructive embeddings,” IEEE TIP, vol. 27, no. 2, pp. 545–555, Feb 2018.
 [19] Y. Huang and Z. Lin, “Binary multidimensional scaling for hashing,” IEEE TIP, vol. 27, no. 1, pp. 406–418, Jan 2018.
 [20] L. y. Duan, Y. Wu, Y. Huang, Z. Wang, J. Yuan, and W. Gao, “Minimizing reconstruction bias hashing via joint projection learning and quantization,” IEEE TIP, vol. 27, no. 6, pp. 3127–3141, June 2018.
 [21] M. Wang, W. Zhou, Q. Tian, and H. Li, “A general framework for linear distance preserving hashing,” IEEE TIP, vol. 27, no. 2, pp. 907–922, Feb 2018.
 [22] K. Grauman and R. Fergus, “Learning binary hash codes for largescale image search,” in Studies in Computational Intelligence, vol. 411, 01 2013.
 [23] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash for indexing big data  a survey,” in Proceedings of the IEEE, 2015.
 [24] J. Wang, H. Tao Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” 08 2014.
 [25] J. Wang, T. Zhang, j. song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” TPAMI, 2017.
 [26] K. Liny, J. Luz, C.S. Cheny, and J. Zhou, “Learning compact binary descriptors with unsupervised deep neural networks,” in CVPR, 2016.
 [27] J. Song, “Binary generative adversarial networks for image retrieval,” in AAAI, 2018.
 [28] Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,” Math. Program., Dec 2013.
 [29] J. Wang, S. Kumar, and S. F. Chang, “Semisupervised hashing for largescale search,” TPAMI, 2012.
 [30] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, 1966.
 [31] M. Gurbuzbalaban, A. Ozdaglar, P. A. Parrilo, and N. Vanli, “When cyclic coordinate descent outperforms randomized coordinate descent,” in NIPS, 2017.
 [32] G. Yuan and B. Ghanem, “An exact penalty method for binary optimization based on mpec formulation,” in AAAI, 2017.
 [33] S. Boyd and L. Vandenberghe, Convex Optimization. New York, NY, USA: Cambridge University Press, 2004.
 [34] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” in Technical report, University of Toronto, 2009.
 [35] R. Uetz and S. Behnke, “Largescale object recognition with cudaaccelerated hierarchical neural networks,” in IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), 2009.
 [36] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” IJCV, Aug 2016.
 [37] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: A database and webbased tool for image annotation,” IJCV, pp. 157–173, 2008.
 [38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” CoRR, 2014.
 [39] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in CVPR, 2015.
 [40] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, pp. 145–175, 2001.
 [41] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.T. Zheng, “Nuswide: A realworld web image database from national university of singapore,” in Proc. of ACM Conf. on Image and Video Retrieval.
Comments
There are no comments yet.