Simultaneous Compression and Quantization: A Joint Approach for Efficient Unsupervised Hashing

02/19/2018 ∙ by Tuan Hoang, et al. ∙ 0

The two most important requirements for unsupervised data-dependent hashing methods are to preserve similarity in the low-dimensional feature space and to minimize the binary quantization loss. Even though there are many hashing methods that have been proposed in the literature, there is room for improvement to address both requirements simultaneously and adequately. In this paper, we propose a novel approach, named Simultaneous Compression and Quantization (SCQ), to jointly learn to compress and binarize input data in a single formulation. A simple scale pre-processing is introduced to help to preserve data similarity. With this approach, we introduce a loss function and its relaxed version, termed Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE) respectively, which involve the challenging binary and orthogonal constraints. We then propose novel algorithms that can effectively handle these challenging constraints. Comprehensive experiments on unsupervised image retrieval show that our proposed methods consistently outperform other state-of-the-art hashing methods while still being very computationally-efficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For decades, image hashing has been an active research field in vision community [1, 2, 3, 4] due to its advantages in storage and computation speed for similarity search/retrieval under specific conditions [2]. Firstly, the binary code should be short so as to the whole hash table can fit in the memory. Secondly, the binary code should preserve the similarity, i.e., (dis)similar images have (dis)similar hashing codes in the Hamming distance space. Finally, the algorithm to learn parameters should be fast and for unseen samples, the hashing method should produce the hash codes efficiently. It is very challenging to simultaneously satisfy all three requirements, especially, under the binary constraint which leads to an NP-hard mixed-integer optimization problem. In this paper, we aim to tackle all these challenging conditions and constraints.

The proposed hashing methods in literature can be categorized into data-independence [5, 6, 7] and data-dependence; in which, the latter recently receives more attention in both (semi-)supervised [8, 9, 10, 11, 12] and unsupervised [13, 14, 2, 15, 16, 17, 18, 19, 20, 21] manners. However, in practice, labeled datasets are limited and costly; hence, in this work, we focus only on the unsupervised setting. We refer readers to recent surveys [22, 23, 24, 25] for more detailed reviews of data-independent/dependent hashing methods.

I-a Related works

The most relevant work to our proposal is Iterative Quantization (ITQ) [2]

, which is a very fast and competitive hashing method. The fundamental of ITQ is two folds. Firstly, to achieve low-dimensional features, it uses the well-known Principle Component Analysis (PCA) method. PCA maximizes the variance of projected data and keeps dimensions pairwise uncorrelated. Hence, the low-dimension data, projected using the top PCA component vectors, can preserve data similarity well. Secondly, minimizing the binary quantization loss using an orthogonal rotation matrix strictly maintains the data pairwise distance. As a result, ITQ learns binary codes that can highly preserve the local structure of the data. However, optimizing these two steps separately, especially when no binary constraint is enforced in the first step, i.e., PCA, leads to suboptimal solutions. In contrast, we propose to jointly optimize the projection variation and the quantization loss.

Other works that are highly relevant to our proposed method are Binary Autoencoder (BA) 

[13] and UH-BDNN [14]. In these methods, the authors proposed to combine the data dimension reduction and binary quantization into a single step by using encoder of autoencoder, while the decoder encourages (dis)similar inputs map to (dis)similar binary codes. However, the reconstruction criterion is not a direct way for preserving the similarity [14]. Additionally, although achieving very competitive performances, UH-BDNN is based on the deep neural network (DNN); hence, it is difficult to produce the binary code computationally-efficiently.

Recently, many works [26, 27]

leverage the powerful capability of Convolution Neuron Network (CNN) to jointly learn the image representations and binary codes. However, due to the non-smooth property of the binary constraint causing the ill-gradient in back-propagation, these methods resort to relaxation or approximation. As a result, even thought achieving high-discriminative image representations, these methods can only produce suboptimal binary codes. In the paper, we show that by directly considering the binary constraint, our methods can obtain much better binary codes. Hence, higher retrieval performances can be achieved.

I-B Contributions

In this work, to address the problem of learning to preserve data affinity in low-dimension binary codes, (i) we first propose a novel loss function to learn a single linear transformation under the

column orthonormal constraint111Please refer to section I-C for our term definitions. in the unsupervised manner that compresses and binarizes the input data jointly. The approach is named as Simultaneous Compression and Quantization (SCQ). Noted that the idea of jointly compressing and binarizing data has been explored in [13, 14]. However, due to the difficulty of the non-convex orthogonal constraint, these works try to relax the orthogonal constraint and resort to the reconstruction criterion as an indirect way to handle the similarity perserving concern. Our work is the first one to tackle the similarity concern by enforcing strict orthogonal constraints.

(ii) Under the strict orthogonal constraints, we conduct analysis and experiments to show that our formulation is able to retain a high amount of the variation and achieve small quantization loss, which are important requirements in hashing for image retrieval [2, 13, 14]. As a result, this leads to improved accuracy as demonstrated in our experiments.

(iii) We then propose to relax the column orthonormal constraint to column orthogonal constraint on the transformation matrix. The relaxation not only helps to gain extra retrieval performances but also significantly improves the training time.

(iv) Our proposed loss functions, with column orthonormal and orthogonal constraints, are confronted with two main challenges. The first is the binary constraint, which is the traditional and well-known difficulty of hashing problem [1, 2, 3]. The second challenge is the non-convex nature of the orthonormal/orthogonal constraint [28]. To tackle the binary constraint, we propose to apply an alternating optimization with an auxiliary variable. Additionally, we resolve the orthonormal/orthogonal constraint by using the cyclic coordinate descent approach to learn one column of the projection matrix at a time while fixing the others. The proposed algorithms are named as Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE).

(v) Comprehensive experiments on common benchmark datasets show considerable improvements on retrieval performance of proposed methods over other state-of-the-art hashing methods. Additionally, the computational complexity and training / online-processing time are also discussed to show the computational efficiency of our methods.

I-C Notations and Term definitions

We first introduce the notations. Given a zero-centered dataset which consists of images and each image is represented by a

-dimension feature descriptor, our proposed hashing methods aim to learn a column orthonormal/orthogonal matrix

which simultaneously compresses input data to -dimensional space, while retains a high amount of variation, and quantizes to binary codes .

It is important to note that, in this work, we abuse the terms: column orthonormal/orthogonal matrix. Specifically, the term column orthonormal matrix is used to indicate the matrix that , where is the identity matrix. While the term column orthogonal matrix indicates matrix that , where is an arbitrary diagonal matrix. Noted that the word “column” means that columns of the matrix are pairwise independent.

We define

as the eigenvalues of the covariance matrix

sorted in descending order. Finally, let be the -th columns of respectively.

For the remaining of the paper is organized as follow. Firstly, Section II presents in details our proposed hashing method, i.e. Orthonormal Encoder (OnE) and provide the analysis to show that our method can retain a high amount of variation and achieve small quantization loss. Section III presents a relax version of OnE, i.e. Orthogonal Encoder (OgE). Section IV presents experiment results to validate the effectiveness of our proposed methods. We conclude the paper in Section V.

Ii Simultaneous Compression & Quantization: Orthonormal Encoder

Ii-a Problem Formulation

In order to jointly learn data dimension reduction and binary quantization using a single linear transformation , we propose to solve the following constrained optimization:

(1)

where denotes the Frobenius norm. Additionally, the orthonormal constrained on the column of is necessary to make sure no redundant information is captured in binary codes [29] and the projection vectors do not scale up/down projected data.

It is noteworthy to highlight the differences between our loss function Eq. (1) and the binary quantization loss function of ITQ [2]. Firstly, different from ITQ, which works on the compressed low-dimensional feature space after using PCA, i.e., ; our approach, instead, works directly on the original high-dimensional feature space . This leads to the second main difference that the non-square column orthonormal matrix simultaneously (i) compresses data to low-dimension and (ii) quantizes to binary codes. However, it is important to note that solving for a non-square projection matrix is challenging. To handle this difficulty, ITQ propose to solve the data compression and binary quantization problems in two separated optimizations. Specifically, it applys PCA to compress data to dimension, and then uses the Orthogonal Procrustes approach [30] to learn a square rotation matrix to optimize binary quantization loss. However, there is a limitation in ITQ approach as no consideration for the binary constraint in the data compression step, i.e., PCA. Consequently, the solution is suboptimal. In this paper, by adopting recent advance in cyclic coordinate descent approach [12, 14, 31, 32], we propose a novel and efficient algorithm to resolve the ITQ limitation by simultaneously attacking both problems in a single optimization problem under the strict orthogonal constraint. Hence, our optimization can lead to a better optimal solution.

Ii-B Optimization

In this section, we discuss the key details of the algorithm (Algorithm 1) for solving the optimization problem Eq. (1). In order to handle the binary constraint in Eq. (1), we propose to use alternating optimization over and .

Input:

: training data;

: code length;

: maximum iteration number;

: convergence error-tolerances;

Output

Column Orthonormal matrix .

1:Randomly initialize such that .
2:for  do
3:     procedure  Fix , update .
4:         Compute (Eq. (4)).      
5:     procedure  Fix , update .
6:         Find using binary search (BS) (Eq. (8)).
7:         Compute (Eq. (7)).
8:         for  do
9:              procedure Solve
10:                  Initialize .
11:                  while true do
12:                       Fix , solve for using BS.
13:                       Fix , compute .
14:                       Compute (Eq. (12)).
15:                       if  then
16:                           return                                                                      
17:     if  and  then break      
18:return
Algorithm 1 Orthonormal Encoder

Ii-B1 Fix and update

When is fixed, the problem becomes exactly the same as when fixing rotation matrix in ITQ [2]. To make the paper self-contained, we repeat the explaination of [2]. By expanding the objective function in Eq. (1), we have

(2)

where . Because is fixed, so is fixed, minimizing (2) is equivalent to maximizing

(3)

where and denotes elements of and respectively. To maximize this expression with respect to , we need to have whenever and otherwise. Hence, the optimal value of can be simply achieved by

(4)

Ii-B2 Fix and update

When fixing , the optimization is no longer a mix-integer problem. However, the problem is still non-convex and difficult to solve due to the orthonormal constraint [28]. It is important to note that is not a square matrix. It means that the objective function is not the classic Orthogonal Procrustes problem [30]. Hence, we cannot achieve the closed-form solution for as proposed in [2]. To the best of our knowledge, there is no easy way for achieving the closed-form solution of non-square . Hence, in order to overcome this challenge, inspired by PCA and recent method cyclic coordinate descent [12, 14, 31, 32], we iteratively learn one vector, i.e., one column of , at a time. We now consider two cases for and .

  • -st vector

(5)

where is the -norm.

Let be the Lagrange multiplier, we formulate the Lagrangian :

(6)

By minimizing over , we can achieve:

(7)

given that maximizes the dual function  222The dual function can be simply constructed by substituting from Eq. (7) into Eq. (6). of [33]. Equivalently, should satisfy the following conditions:

(8)

where is the smallest eigenvalue of .

In Eq. (8), the first condition is to make sure that is non-singular and the second condition is achieved by setting the derivative of with regard to equal to .

The second equation in Eq. (8) can be recognized as a -order polynomial equation of which has no explicit closed-form solution for when . Fortunately, since is a concave function of , is monotonically decreasing. Hence, we can simply solve for using binary search with a small error-tolerance . Note that:

(9)

thus always has a solution.

  • -th vector

For the second vector onward, besides the unit-norm constraint, we also need to make sure that the current vector is independent with its previous vectors.

(10)

Let and be the Lagrange multipliers, we also formulate the Lagrangian :

(11)

Minimizing over , similar to Eq. (7), we can achieve:

(12)

given that satisfy the following conditions which make the corresponding dual function maximum:

(13)

where

(14)

in which .

There is also no straight-forward solution for . In order to resolve this difficulty, we propose to use alternative optimization to solve for and . In particular, (i) given a fixed (initialized as ), we find using binary search as discussed above. Additionally, similar to , there is always a solution for . Then, (ii) with fixed , we can get the closed-form solution for as . Note that since the dual function is a concave function of , alternative optimizing between and still guarantees the solution to approach the global optimal one.

Figure 1 shows an error convergence curve of the optimization problem Eq. (1). We stop the optimization when the relative reduction of the quantization loss is less than , i.e., .

Fig. 1: Quantization error for learning the projection matrix with on the CIFAR-10 dataset (section IV-A).

Ii-C Retained variation and quantization loss

In the hashing problem for image retrieval, both retained variation and quantization loss are important. In this section, we provide analysis to show that, when solving Eq. (1), it is possible to retain a high amount of the variation and achieve small quantization loss. As will be discussed in more details, this can be accomplished by applying an appropriate scale S on the input dataset. Noticeably, by applying any positive scale  333For simplicity, we only discuss positive value . Negative value should have similar effects. on the dataset, the local structure of data is strictly preserved, i.e., the ranking nearest neighbor set of every data point is always the same. Therefore, in the hashing problem for retrieval task, it is equivalent to work on a scaled version of the dataset, i.e., . We can re-write the loss function of Eq. (1) as following:

(15)

where is the element-wise operation to find the absolute values and is the all-1 matrix. In what follows, we discuss how can affect the retained variation and quantization loss.

Ii-C1 Maximizing retained variation

We recognize that by scaling to the dataset by an appropriate scale , such that all projected data points are inside the hyper-cube of , i.e., , the maximizing retained variation problem (PCA) can achieve similar results to the minimizing quantization loss problem, i.e., . Intuitively, we can interpret the former problem, i.e., PCA, as to find the projection that maximizes the distances of projected data points from the coordinate origin. While the latter problem, i.e., minimizing binary quantization loss, tries to find the projection matrix that minimizes the distances of projected data points from or correspondingly. A simple 1-D illustration to explain the relationship between two problems is given in Figure 2.

Since each vector of is constrained to have the unit norm, the condition actually can be satisfied by scaling the dataset by to have all data points in the original space inside the hyper-ball with unit radius, in which is equal to the largest -distance between data points and the coordinate origin.

Fig. 2: An illustration of the relationship between the minimizing quantization loss and maximizing retained variation problems.

Ii-C2 Minimizing quantization loss

(a)
(b)
(c)
Fig. 3: A toy example for and to illustrate how the quantization loss and the minimizing quantization loss vector (green dash line) vary when increases. The values in legends present the variances and the quantization losses per bit, , of the data which is projected in corresponding vectors (rounding to two decimal places).

Regarding the quantization loss (Eq. 15), which is a convex function of , by setting , we have the optimal solution for as following:

(16)

where is the all-0 matrix.

Considering , there are two important findings. Firstly, there is obviously no scaling value that can concurrently achieve and , except the case which is unreal in practice. Secondly, from Eq. (16), we can recognize that as gets larger, i.e., gets smaller, minimizing the loss will produce that focuses on lower-variance directions so as to achieve smaller as well as smaller . It means that gets closer to the global minimum of . Consequently, the quantization loss becomes smaller. In Figure 3, we show a toy example to illustrate that as increases, minimizing quantization loss diverts the projection vector from top-PCA component (Figure 2(a)) to smaller variance directions (Figure 2(b) 2(c)), while the quantization loss (per bit) gets smaller (Figure 2(a) 2(c)). In summary, as gets smaller, the quantization loss is smaller and vice versa. However, note that keeping increasing when already focuses on least-variance directions will make the quantization loss larger.

Note that the scale is a hyper-parameter in our system. In the experiment section (Section IV-B), we will additionally conduct experiments to quantitatively analyse the effect of the scale hyper-parameter and determine proper values using validation dataset.

Iii Simultaneous Compression & Quantization: Orthogonal Encoder

Iii-a Problem Re-formulation: Orthonormal to Orthogonal

In Orthonormal Encoder (OnE), we work with the column orthonormal constraint on . However, we recognize that relaxing this constraint to column orthogonal constraint, i.e., relaxing the unit norm constraint on each column of , by converting it into a penalty term, provides three important advantages. We now achieve the new loss function as following:

(17)

where is a fixed positive hyper-parameter to penalize large norms of . It is important to note that, in Eq. (17), we still enforce the strict pairwise independent constraint of projection vectors to ensure no redundant information is captured.

Firstly, with an appropriately large , the optimization prefers to choose large variance components of since this helps to achieve the projection vectors that have smaller norms. In other words, without penalizing large norms of , the optimization has no incentive to focus on high variance components of since it can produce projection vectors with arbitrary large norms that can scale any components appropriately to achieve minimum binary quantization loss. Secondly, this provides more flexibility of having different scale values for different directions. Consequently, relaxing the unit-norm constraint of each column of helps to mitigate the difficulty of choosing the scale value . However, it is important to note that a too large , on the other hand, may distract the optimization from minimizing the binary quantization term. Finally, from OnE Optimization (Section II-B), we observed that the unit norm constraint on each column of makes the OnE optimization difficult to be solved efficiently since there is no closed-form solution for . By relaxing this unit norm constraint, we now can achieve the closed-form solutions for ; hence, it is very computationally beneficial. We will discuss more about the computational aspect in section III-C.

Iii-B Optimization

Similar to the Algorithm 1 for solving Orthonormal Encoder, we apply alternative optimize and with the step is exactly the same as Eq. (4). For step, we also utilize the cyclic coordinate descent approach to iteratively solve , i.e., column by column. The loss functions are rewritten and their corresponding closed-form solutions for can be efficiently achieved as following:

  • -st vector

(18)

We can see that Eq. (18) is the regularized least squares problem, whose closed-form solution is given as:

(19)
  • -th vector

(20)

Given the Lagrange multiplier , similar to Eq. (7) and Eq. (11), we can obtain as following:

(21)

where , in which

(22)

and .

Note that, given a fixed , is a constant matrix, the matrix contains matrix in the top-left corner. It means that only the -th row and column of matrix are needed to be computed. Thus, can be solved even more effectively.

Finally, similar to OnE (Fig. 1), we also empirically observe the convergence of the optimization problem Eq. 17. We summarize the Orthogonal Encoder method in Algorithm 2.

Input:

: training data;

: code length;

: maximum iteration number;

: convergence error-tolerance;

Output

Column Orthogonal matrix .

1:Randomly initialize such that .
2:for  do
3:     Fix , update : Compute (Eq. (4)).
4:     Fix , update : Compute (Eq. (19), (21)).
5:     if  and  then break      
6:return
Algorithm 2 Orthogonal Encoder

Iii-C Complexity analysis

The complexity of the two algorithms, OnE and OgE, are shown in Table I. In our empirical experiments, is usually around 50, is at most 10 iterations, and (for CNN fully-connected features (Section IV-A)). Firstly, we can observe that OgE is very efficient as its complexity is only linearly depended on the number of training samples , feature dimension , and code length . In addition, OgE is also faster than OnE. Furthermore, as our methods aim to learn the projection matrices that preserve high-variance components, it is unnecessary to work on very high dimensional features. As there are many low-variance/noisy components, which will be discarded eventually. In practice, we observe no performance drop when applying PCA to compress feature to much lower dimensions, e.g., 512-D. While this helps to achieve significant speed-up in training time for both algorithms, especially for the OnE, as its time complexity is depended on for large . In addition, we conduct experiments to measure the actual running time of the algorithms and compare with other methods in section IV-D.

Computational complexity
OnE
OgE
TABLE I: Computational complexity of algorithm OnE and OgE. where is the number of training samples, is the feature dimension, is the number of iteration to alternative update and , and is the number of iterations for solving in Algorithm 1.

Iv Experiments

Iv-a Datasets, Evaluation protocols, and Implementation notes

(a) CIFAR-10
(b) Labelme-12-50k
(c) SUN397
Fig. 4: Analyzing the effects of the scale value on (i) the quantization loss per bit (blue dash line with blue right Y-axis), (ii) the percentage of total retained variation by the minimizing quantization loss projection matrix in comparison with the total retained variation of top- PCA components (red line with red right Y-axis), and (iii) the retrieval performance in mAP (green line with green left Y-axis). Note that x-axis is in descending order.
Dataset CIFAR-10 [34] LabelMe-12-50k [35] SUN397 [36]
L 8 16 24 32 8 16 24 32 8 16 24 32
mAP SpH [16] 17.09 18.77 20.19 20.96 11.68 13.24 14.39 14.97 9.13 13.53 16.63 19.07
KMH [15] 22.22 24.17 24.71 24.99 16.09 16.18 16.99 17.24 21.91 26.42 28.99 31.87
BA [13] 23.24 24.02 24.77 25.92 17.48 17.10 17.91 18.07 20.73 31.18 35.36 36.40
ITQ [2] 24.75 26.47 26.86 27.19 17.56 17.73 18.52 19.09 20.16 30.95 35.92 37.84
SCQ - OnE 27.08 29.64 30.57 30.82 19.76 21.96 23.61 24.25 23.37 34.09 38.13 40.54
SCQ - OgE 26.98 29.33 30.65 31.15 20.63 23.07 23.54 24.68 23.44 34.73 39.47 41.82
prec@r2 SpH 18.04 30.58 37.28 21.40 11.72 19.38 25.14 13.66 6.88 23.68 37.21 27.39
KMH 21.97 36.64 42.33 27.46 15.20 26.17 32.09 18.62 9.50 36.14 51.27 39.29
BA 23.67 38.05 42.95 23.49 16.22 25.75 31.35 13.14 10.50 37.75 50.38 41.11
ITQ 24.38 38.41 42.96 28.63 15.86 25.46 31.43 17.66 9.78 35.15 49.85 46.34
SCQ - OnE 24.48 36.49 41.53 43.90 16.69 27.30 34.63 33.04 8.68 30.12 43.54 50.41
SCQ - OgE 24.35 38.30 43.01 44.01 16.57 27.80 34.77 34.64 8.76 29.31 45.03 51.88
prec@1k SpH 22.93 26.99 29.50 31.98 14.07 16.78 18.52 19.27 10.79 15.36 18.21 20.07
KMH 32.30 33.65 35.52 37.77 21.07 20.97 21.41 21.98 18.94 24.93 25.74 28.26
BA 31.73 34.16 35.67 37.01 21.14 21.71 22.64 22.83 19.22 28.68 31.31 31.80
ITQ 32.40 36.35 37.25 37.96 21.01 22.00 22.98 23.63 18.86 28.62 31.56 32.74
SCQ - OnE 33.38 37.82 39.13 40.40 22.91 25.39 26.55 27.16 19.26 29.95 32.72 34.08
SCQ - OgE 33.41 38.33 39.54 40.70 23.94 25.94 26.99 27.46 20.10 29.95 33.43 35.00
TABLE II: Performance comparison with the state-of-the-art unsupervised hashing methods.
The Bold and Underline values indicate the best and second best performances respectively.

The CIFAR-10 dataset [34] contains fully-annotated color images of from object classes ( images for each class). The provided test set ( images for each class) is used as the query set. The remaining 50,000 images are used as training set and database.

The LabelMe-12-50k dataset [35] consists of fully annotated color images of of object classes, which is a subset of LabelMe dataset [37]. In this dataset, for image having multiple label values in , the object class of the largest label value is chosen as the image label. We also use the provided test set as query set and the remaining images as training set and database.

The SUN397 dataset [36] contains about fully annotated color images from scene categories. We select a subset of categories which contain more than 500 images to construct a dataset of about images in total. We then randomly sample 100 images per class to form the query set. The remaining images are used as training set and database.

For these above image datasets, each image is represented by a 4096-D feature vector extracted from the fully-connected layer 7 of pre-trained VGG [38].

Evaluation protocols.

As datasets are fully annotated, we use semantic labels to define the ground truths of image queries. We apply three standard evaluation metrics, which are widely used in literature

[13, 39, 2], to measure the retrieval performance of all methods: 1) mean Average Precision (mAP()); 2) precision at Hamming radius of 2 (prec@r2 ()) which measures precision on retrieved images having Hamming distance to query (we report zero precision for the queries that return no image); 3) precision at top 1000 return images (prec@1k ()) which measures the precision on the top 1000 retrieved images.

Implementation notes. As discussed in section III-C, for computational efficiency, we apply PCA to reduce the feature dimension to 512-D for our proposed methods. The hyper-parameter of OgE algorithm is empirically set as for all experiments. Finally, for both OnE and OgE, we set all error-tolerance values, , as and the maximum number of iteration is set as . The implementation of our methods is available at \(https://github.com/hnanhtuan/SCQ.git\).

For all compared methods, e.g., Spherical Hashing (SpH) [16]

, K-means Hashing (KMH)

444Due to very long training time at high-dimension of KMH [15], we apply PCA to reduce dimension from 4096-D to 512-D. Additionally, we execute experiments for KMH with and report the best results. [15], Binary Autoencoder (BA) [13], and Iterative Quantization (ITQ) [2]; we use the implementation with suggested parameters provided by the authors. Besides, to improve the statistical stability in the results, we report the average values of 5 executions.

Iv-B Effects of parameters

As discussed in section II-C, when decreases, the projection matrix can be learned to retain a very high amount of variation, as much as PCA can. However, it causes undesirable large binary quantization loss and vice versa. In this section, we additionally provide quantitative analysis of effects of the scale parameter on these two factors and, moreover, on the retrieval performance.

In this experiment, for all datasets, e.g., CIFAR-10, LabelMe-12-50k, and SUN397, we random select 20 images for each class in the training set (as discussed in section IV-A) for validation set. The remaining images are used for training. To obtain each data point, we solve the problem Eq. (1) at various scale values and use OnE algorithm (Algorithm 1 - Section II-B) to tackle the optimization.

Figure 4 presents (i) the quantization loss per bit, (ii) the percentage of total retained variation by the minimizing quantization loss projection matrix in comparison with the total retained variation of top- PCA components as varies, and (iii) the retrieval performance (mAP) of the validation sets. Firstly, we can observe that there is no scale that can simultaneously maximizes the retained variation and minimizes the optimal quantization loss. On the one hand, as the scale value decreases, minimizing the loss function Eq. (15) produces a projection matrix that focuses on high-variance directions, i.e., retains more variation in comparison with PCA (red line). On the other hand, at smaller , the quantization loss is much larger (blue dash line). The empirical results are consistent with our discussion in section II-C.

Secondly, regarding the retrieval performance, unsurprisingly, the performance drops as the scale gets too small, i.e., a high amount of variation is retained but the quantization loss is too large, or gets too large, i.e., the quantization loss is small but only low variance components are retained. Hence, it is necessary to balance these two factors. As data variation varies from dataset to dataset, the scale value should be determined from the dataset. In particular, we leverage the eigenvalues , which are the variances of PCA components, to determine this hyper-parameter. From experimental results in Figure 4, we propose to formulate the scale parameter as:

(23)

This setting can generally achieve the best performances across multiple datasets, feature types, and hash lengths, without resort to conducting multiple trainings and cross-validations. The proposed working points of the scale are shown in Figure 4. We apply this scale parameter to the datasets for both OnE and OgE algorithms in all later experiments.

Note that the numerator of the fraction in Eq. 23, i.e., is the hash code length, which is also the total variation of binary codes . In addition, the denominator is the total variation of top -th PCA components, i.e. the maximum amount of variation that can be retained in an -dimension feature space. Hence, we can interpret the scale as a factor that make the amounts of variation, i.e. energy, of the input and output (i.e. binary codes ) are comparable. This property is important as when the variation of input is much larger than the variation of output, obviously there is some information loss. On the other hand, when the variation of output is larger than it of input, the output contains undesirable additional information.

Iv-C Comparison with state-of-the-art

In this section, we evaluate our proposed hashing methods, SCQ - OnE and OgE, and compare to the state-of-the-art unsupervised hashing methods including SpH, KMH, BA, and ITQ. The experimental results in mAP, prec@r2 and prec@1k are reported in Table II. Our proposed methods clearly achieves significant improvement over all datasets at the majority of evaluation metrics. The improvement gaps are clearer at higher code lengths, i.e., L = 32. Additionally, OgE generally achieves slightly higher performance than OnE. Moreover, it is noticeable that, for prec@r2, all compared methods suffer performance downgrade at long hash code, e.g., . However, our proposed methods still achieve good prec@r2 at . This shows that binary codes producing by our methods highly preserve data similarity.

Methods mAP prec@r2
16 32 16 32
CIFAR-10 DH [39] 16.17 16.62 23.33 15.77
UH-BDNN [14] 17.83 18.52 24.97 18.85
SCQ - OnE 17.97 18.63 24.57 23.72
SCQ - OgE 18.00 18.78 24.15 25.69
TABLE III: Performance comparison in mAP and prec@r2 with Deep Hashing (DH) and Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN) on CIFAR-10 dataset for and . The Bold values indicate the best performances.

Comparison with Deep Hashing (DH) [39] and Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN) [14]. Recently, there are several methods [39, 14] applying DNN to learn binary hash code, which achieved very competitive performances. Hence, in order to have a complete evaluation, following the experiment settings of [39, 14], we conduct experiments on CIFAR-10 dataset. In this experiment, 100 images are randomly sampled for each class as a query set; the remaining images are for training and database. Each image is presented by a GIST 512-D descriptor [40]. In addition, to avoid bias results due to test samples, we repeat the experiment 5 times with 5 different random training/query sets. The comparative results in term of mAP and prec@r2 are presented in Table III. Our proposed methods are very competitive with DH and UH-BDNN, specifically achieving higher mAP and prec@r2 at than DH and UH-BDNN.

Methods CIFAR-10 NUS-WIDE
12 24 32 48 12 24 32 48
mAP BGAN [27] 40.1 51.2 53.1 55.8 67.5 69.0 71.4 72.8
SCQ - OnE 53.59 55.77 57.62 58.14 69.82 70.53 72.78 73.25
SCQ - OgE 53.83 55.65 57.74 58.44 70.17 71.31 72.49 72.95
TABLE IV: Performance comparison in mAP with BGAN on CIFAR-10 and NUS-WIDE datasets.

Comparison with Binary Generative Adversarial Networks for Image Retrieval (BGAN) [27]. Recently, BGAN applies a continuous approximation of sign function to learn the binary codes which can help to generate images plausibly similar to the original images. The method has been proven to achieve outstanding performances in unsupervised image hashing task. We note that BGAN is different with our method and compared methods in that BGAN jointly learns image feature representations and binary codes, in which the binary codes are achieved by using an approximate smooth function of sign

. While ours and compared methods learn the optimal binary codes given image representations. Hence, to further validate the effectiveness of our methods and to compare with BGAN, we apply our method on the FC7 features extracted from the feature extraction component in the pre-trained BGAN model on CIFAR10 and NUS-WIDE

[41] datasets. In this experiment, we aim to show that by applying our hashing methods on the pretrained features from feature extraction component of BGAN, our methods can produce better hash codes than the jointly learning approach of BGAN. Similar to BGAN [27], for both CIFAR-10 and NUS-WIDE, we randomly select 100 images per class as the test query set; the remaining images are used as database for retrieval. We then randomly sample from the database set 1,000 images per class as the training set. The Table IV shows that by using the more discriminative features from the pre-trained feature extraction component of BGAN, our methods can outperform BGAN, i.e., our methods can produce better binary codes in comparison to the sign approximate function in BGAN, and achieve the state-of-the-art performances in the unsupervised image hashing task.

Iv-D Training time and Processing time

In this experiment, we empirically evaluate the training time and online processing time of our methods. The experiments are carried out on a workstation with a 4-core i7-6700 CPU @ 3.40GHz. The experiments are conducted on the combination of CIFAR-10, Labelme-12-50k, and SUN397 datasets. For OnE and OgE, the training time include time for applying zero-mean, scaling, reducing dimension to . We use 50 iterations for all experiments. The Fig. 5 shows that our proposed methods, OnE and OgE, are very efficient. OgE is just slightly slower than ITQ [2]. Even though OnE is slower than OgE and ITQ, it takes just over a minute for 100.000 training samples which is still very fast and practical, in comparison with several dozen minutes for KMH [15], BA [13] and UH-BDNN [14]555For training 50000 CIFAR-10 samples using author’s release code and dataset [14]..

Compared with training cost, the time to produce new hash codes is more important since it is done in real time. Similar to Semi-Supervised Hashing (SSH) [29] and ITQ [2], by using only a single linear transformation, our proposed methods require only one BLAS operation (gemv or gemm) and a comparison operation; hence, it takes negligible time to produce binary codes for new data points.

Fig. 5: The training time for learning 32-bit hash code embedding.

V Conclusion

In this paper, we successfully addressed the problem of jointly learning to preserve data pairwise (dis)similarity in low-dimension space and minimize the binary quantization loss with strict diagonal constraint. Additionally, we show that as more variation is retained, it causes undesirable large quantization loss and vice versa. Hence, by appropriately balancing these two factors using a scale, our methods can produce better binary codes. Extensive experiments on various datasets show that our proposed methods, Simultaneous Compression and Quantization (SCQ): Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE), outperform other state-of-the-art hashing methods by clear margins under various standard evaluation metrics and benchmark datasets. Furthermore, OnE and OgE are very computationally efficient in both training and testing steps.

References

  • [1] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,” Commun. ACM, vol. 51, Jan. 2008.
  • [2] Y. Gong and S. Lazebnik, “Iterative quantization: A procrustean approach to learning binary codes,” in CVPR, 2011.
  • [3] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in NIPS, 2009.
  • [4] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast similarity search,” in ACM SIGIR, 2010.
  • [5] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in VLDB, 1999.
  • [6] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in ICCV, Nov 2009.
  • [7] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in NIPS, 2009.
  • [8] B. Kulis and T. Darrell, “Learning to hash with binary reconstructive embeddings,” in NIPS, 2009.
  • [9]

    G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter, “Fast supervised hashing with decision trees for high-dimensional data,” in

    CVPR, 2014.
  • [10] W. Liu, J. Wang, R. Ji, Y. G. Jiang, and S. F. Chang, “Supervised hashing with kernels,” in CVPR, 2012.
  • [11] M. Norouzi, D. J. Fleet, and R. Salakhutdinov, “Hamming distance metric learning,” in NIPS, 2012.
  • [12] F. Shen, C. Shen, W. Liu, and H. T. Shen, “Supervised discrete hashing,” in CVPR, 2015.
  • [13] M. Á. Carreira-Perpiñán and R. Raziperchikolaei, “Hashing with binary autoencoders,” in CVPR, 2015.
  • [14] T.-T. Do, A.-D. Doan, and N.-M. Cheung, “Learning to hash with binary deep neural network,” in ECCV, 2016.
  • [15] K. He, F. Wen, and J. Sun, “K-means hashing: An affinity-preserving quantization method for learning binary compact codes,” in CVPR, 2013.
  • [16] J. P. Heo, Y. Lee, J. He, S. F. Chang, and S. E. Yoon, “Spherical hashing,” in CVPR, 2012.
  • [17] F. Shen, Y. Xu, L. Liu, Y. Yang, Z. Huang, and H. T. Shen, “Unsupervised deep hashing with similarity-adaptive and discrete optimization,” IEEE TPAMI, pp. 1–1, 2018.
  • [18] M. Hu, Y. Yang, F. Shen, N. Xie, and H. T. Shen, “Hashing with angular reconstructive embeddings,” IEEE TIP, vol. 27, no. 2, pp. 545–555, Feb 2018.
  • [19] Y. Huang and Z. Lin, “Binary multidimensional scaling for hashing,” IEEE TIP, vol. 27, no. 1, pp. 406–418, Jan 2018.
  • [20] L. y. Duan, Y. Wu, Y. Huang, Z. Wang, J. Yuan, and W. Gao, “Minimizing reconstruction bias hashing via joint projection learning and quantization,” IEEE TIP, vol. 27, no. 6, pp. 3127–3141, June 2018.
  • [21] M. Wang, W. Zhou, Q. Tian, and H. Li, “A general framework for linear distance preserving hashing,” IEEE TIP, vol. 27, no. 2, pp. 907–922, Feb 2018.
  • [22] K. Grauman and R. Fergus, “Learning binary hash codes for large-scale image search,” in Studies in Computational Intelligence, vol. 411, 01 2013.
  • [23] J. Wang, W. Liu, S. Kumar, and S. Chang, “Learning to hash for indexing big data - a survey,” in Proceedings of the IEEE, 2015.
  • [24] J. Wang, H. Tao Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” 08 2014.
  • [25] J. Wang, T. Zhang, j. song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” TPAMI, 2017.
  • [26] K. Liny, J. Luz, C.-S. Cheny, and J. Zhou, “Learning compact binary descriptors with unsupervised deep neural networks,” in CVPR, 2016.
  • [27] J. Song, “Binary generative adversarial networks for image retrieval,” in AAAI, 2018.
  • [28] Z. Wen and W. Yin, “A feasible method for optimization with orthogonality constraints,” Math. Program., Dec 2013.
  • [29] J. Wang, S. Kumar, and S. F. Chang, “Semi-supervised hashing for large-scale search,” TPAMI, 2012.
  • [30] P. H. Schönemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, 1966.
  • [31] M. Gurbuzbalaban, A. Ozdaglar, P. A. Parrilo, and N. Vanli, “When cyclic coordinate descent outperforms randomized coordinate descent,” in NIPS, 2017.
  • [32] G. Yuan and B. Ghanem, “An exact penalty method for binary optimization based on mpec formulation,” in AAAI, 2017.
  • [33] S. Boyd and L. Vandenberghe, Convex Optimization.   New York, NY, USA: Cambridge University Press, 2004.
  • [34] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” in Technical report, University of Toronto, 2009.
  • [35] R. Uetz and S. Behnke, “Large-scale object recognition with cuda-accelerated hierarchical neural networks,” in IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), 2009.
  • [36] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” IJCV, Aug 2016.
  • [37] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: A database and web-based tool for image annotation,” IJCV, pp. 157–173, 2008.
  • [38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, 2014.
  • [39] V. Erin Liong, J. Lu, G. Wang, P. Moulin, and J. Zhou, “Deep hashing for compact binary codes learning,” in CVPR, 2015.
  • [40] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, pp. 145–175, 2001.
  • [41] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng, “Nus-wide: A real-world web image database from national university of singapore,” in Proc. of ACM Conf. on Image and Video Retrieval.