Hashing has become a popular tool to tackle large-scale tasks in information retrieval, computer vision and machine leaning communities, since it aims to encode originally high-dimensional data into a variety of compact binary codes with maintaining the similarity between neighbors, leading to significant gains in both computation and storage  .
Early endeavors in hashing focus on data-independent algorithms, like locality sensitive hashing (LSH)  and min-wise hashing (MinHash)  . They construct hash functions by using random projections or permutations. However, due to randomized hashing, in practice they usually require long bits to achieve high precision per hash table and multiple tables to boost the recall . To learn compact binary codes, data-dependent algorithms using available training data to learn hash functions have attracted increasing attention. Based on whether utilizing semantic label information, data-dependent algorithms can be categorized into two main groups: unsupervised and supervised. Unlike unsupervised hashing    that explores data intrinsic structures to preserve similarity relations between neighbors without any supervision, supervised hashing    employs semantic information to learn hash functions, and thus it usually achieves better retrieval accuracy than unsupervised hashing on semantic similarity measures.
Among supervised hashing algorithms, pairwise based hashing, maintaining the relationship of similar or dissimilar pairs in a Hamming space, is one popular method to exploit label information. Numerous pairwise based algorithms have been proposed in the past decade, including spectral hashing (SH) , linear discriminant analysis hashing (LDAHash) , minimal loss hashing (MLH) , binary reconstruction embedding (BRE)  and kernel-based supervised hashing (KSH) , etc. Although these algorithms have been demonstrated effective in many large-scale tasks, their employed optimization strategies are usually insufficient to explore the similarity information defined in the non-convex and non-differential objective functions. In order to handle these non-smooth and non-convex problems, four main strategies have been proposed: symmetric/asymmetric relaxation, and asymmetric/symmetric discrete. Symmetric relaxation    
is to relax discrete binary vectors in a continuous feasible region followed by thresholding to obtain binary codes. Although symmetric relaxation can simplify the original optimization problem, it often generates large accumulated errors between hash and linear functions. To reduce the accumulated error, asymmetric relaxation utilizes the element-wise product of discrete and its relaxed continuous matrices to approximate a pairwise label matrix. Asymmetric discrete hashing    usually utilizes the product of two distinct discrete matrices to preserve pair relations into a binary space. Symmetric discrete hashing  
firstly learns binary codes with preserving symmetric discrete constraints and then trains classifiers based on the learned discrete codes. Although most of hashing algorithms with these four strategies have achieved promising performance, they have at least one of the following four major disadvantages: (i) Learning binary codes employs relaxation and thresholding strategies, thereby producing large accumulated errors; (ii) Learning binary codes requires high storage and computation costs, i.e., where is the number of data points, thereby limiting their applications to large-scale tasks; (iii) The used pairwise label matrix usually emphasizes the difference of images among different classes but neglects their relevance within the same class. Hence, existing optimization methods might perform poorly to preserve the relevance information among images; (iv) The employed optimization methods focus on one type of optimization problems and it is difficult to directly apply them to other problems.
Motivated by aforementioned observations, in this paper, we propose a novel simple, general and scalable optimization method that can solve various pairwise based hashing models for directly learning binary codes. The main contributions are summarized as follows:
We propose a novel alternative optimization mechanism to reformulate one typical quartic problem, in term of hash functions in the original objective of KSH , into a linear problem by introducing a linear regression model.
We present and analyze a significant discovery that gradually updating each batch of binary codes in a sequential mode, i.e. batch by batch, is greatly beneficial to the convergence of binary code learning.
We propose a scalable symmetric discrete hashing algorithm with gradually updating each batch of one discrete matrix. To make the update step more smooth, we further present a greedy symmetric discrete hashing algorithm to greedily update each bit of batch discrete matrices. Then we demonstrate that the proposed greedy hashing algorithm can be used to solve other optimization problems in pairwise based hashing.
2 Related Work
Based on the manner of using the label information, supervised hashing can be classified into three major categories: point-wise, multi-wise and pairwise.
Point-wise based hashing formulates the searching into one classification problem based on the rule that the classification accuracy with learned binary codes should be maximized. Supervised discrete hashing (SDH)  leverages one linear regression model to generate optimal binary codes. Fast supervised discrete hashing (FSDH)  improves the computation requirement of SDH via fast SDH approximation. Supervised quantization hashing (SQH) 
introduces composite quantization into a linear model to further boost the discriminative ability of binary codes. Deep learning of binary hash codes (DLBHC) and deep supervised convolutional hashing (DSCH) 
employ convolutional neural network to simultaneously learn image representations and hash codes in a point-wised manner. Point-wise based hashing is scalable and its optimization problem is relatively easier than multi-wise and pairwise based hashing; however, its rule is inferior compared to the other two types of supervised hashing.
Multi-wise based hashing is also named as ranking based hashing that learns hash functions to maximize the agreement of similarity orders over two items between original and Hamming distances. Triplet ranking hashing (TRH)  and column generation hashing (CGH)  utilize a triplet ranking loss to maximumly preserve the similarity order. Order preserving hashing (OPH)  learns hash functions to maximumly preserve the similarity order by taking it as a classification problem. Ranking based supervised hashing (RSH)  constructs a ranking triplet matrix to maintain orders of ranking lists. Ranking preserving hashing (RPH)  learns hash functions by directly optimizing a ranking measure, Normalized Discounted Cumulative Gain (NDCG) . Top rank supervised binary coding (Top-RSBC)  focuses on boosting the precision of top positions in a Hamming distance ranking list. Discrete semantic ranking hashing (DSeRH)  learns hash functions to maintain ranking orders with preserving symmetric discrete constraints. Deep network in network hashing (DNNH) , deep semantic ranking hashing (DSRH)  and triplet-based deep binary embedding (TDBE)  utilize convolutional neural network to learn image representations and hash codes based on the triplet ranking loss over three items. Most of these multi-wise based hashing algorithms relax the ranking order or discrete binary codes in a continuously feasible region to solve their original non-convex and non-smooth problems.
Pairwise based hashing maintains relationship among originally high-dimensional data into a Hamming space by calculating and preserving the relationship of each pair. SH  constructs one graph to maintain the similarity among neighbors and then utilizes it to map the high-dimensional data into a low-dimensional Hamming space. Although the original version of SH is unsupervised hashing, it is easily converted into a supervised algorithm. Inspired by SH, many variants including anchor graph hashing , elastic embedding , discrete graph hashing (DGH) , and asymmetric discrete graph hashing (ADGH)  have been proposed. LDAHash  projects the high-dimensional descriptor vectors into a low-dimensional Hamming space with maximizing the distance of inter-class data and meanwhile minimizing the intra-class distances. MLH  adopts a structured prediction with latent variables to learn hash functions. BRE  aims to minimize the difference between Euclidean distances of original data and their Hamming distances. It leverages a coordinate-descent algorithm to solve the optimization problem with preserving symmetric discrete constraints. SSH  introduces a pairwise matrix and KSH  leverages the Hamming distance between pairs to approximate the pairwise matrix. This objective function is intuitive and simple, but the optimization problem is highly non-differential and difficult to directly solve. KSH utilizes a “symmetric relaxation + greedy” strategy to solve the problem. Two-step hashing (TSH)  and FastHash  relax the discrete constraints into a continuous region . Kernel based discrete supervised hashing (KSDH)  adopts asymmetric relaxation to simultaneously learn the discrete matrix and a low-dimensional projection matrix for hash functions. Lin: Lin and Lin: V , asymmetric inner-product binary coding (AIBC)  and asymmetric discrete graph hashing (ADGH)  employ the asymmetric discrete mechanism to learn low-dimensional matrices. Column sampling based discrete supervised hashing (COSDISH)  adopts the column sampling strategy same as latent factor hashing (LFH)  but directly learn binary codes by reformulating the binary quadratic programming (BQP) problems into equivalent clustering problems. Convolutional neural network hashing (CNNH)  divide the optimization problem into two sub-problems : (i) learning binary codes by a coordinate descent algorithm using Newton directions; (ii) training a convolutional neural network using the learned binary codes as labels. After that, deep hashing network (DHN)  and deep supervised pairwise hashing (DSPH)  simultaneously learn image representations and binary codes using pairwise labels. HashNet  learns binary codes from imbalanced similarity data. Deep cauchy hashing (DCH)  utilizes pairwise labels to generate compact and concentrated binary codes for efficient and effective Hamming space retrieval. Unlike previous work, in this paper we aim to present a simpler, more general and scalable optimization method for binary code learning.
3 Symmetric Discrete Hashing via A Pairwise Matrix
In this paper, matrices and vectors are represented by boldface uppercase and lowercase letters, respectively. For a matrix , its -th row and -th column vectors are denoted as and , respectively, and is one entry at the -th row and -th column.
KSH  is one popular pairwise based hashing algorithm, which can preserve pairs’ relationship with using two identical binary matrices to approximate one pairwise real matrix. Additionally, it is a quartic optimization problem in term of hash functions, and thus more typical and difficult to solve than that only containing a quadratic term with respect to hash functions. Therefore, we first propose a novel optimization mechanism to solve the original problem in KSH, and then extend the proposed method to solve other pairwise based hashing models.
Given data points , suppose one pair when they are neighbors in a metric space or share at least one common label, and when they are non-neighbors in a metric space or have different class labels. For the single-label multi-class problem, the pairwise matrix is defined as :
For the multi-label multi-class problem, similar to , can be defined as:
where is the relevance between and , which is defined as the number of common labels shared by and . is the weight to describe the difference between and . In this paper, to preserve the difference between non-neighbor pairs, we empirically set , where is the maximum relevance among all neighbor pairs. We do not set because few data pairs have the relevance being .
To encode one data point into -bit hash codes, its -th hash function can be defined as:
where is a projection vector, and if , otherwise . Note that since can be written as the form with adding one dimension and absorbing , for simplicity we utilize in this paper. Let be hash codes of , and then for any pair , we have . To approximate the pairwise matrix , same as , a least-squares style objective function is defined as:
where and is a low-dimensional projection matrix. Eq. (4) is a quartic problem in term of hash functions, and this can be demonstrated by expanding its objective function.
3.2 Symmetric Discrete Hashing
3.2.1 Formulation transformation
In this subsection, we show the procedure to transform Eq. (4) into a linear problem. Since the objective function in Eq. (4) is a highly non-differential quartic problem in term of hash functions , it is difficult to directly solve this problem. Here, we solve the problem in Eq. (4) via a novel alternative optimization mechanism: reformulating the quartic problem in term of hash functions into a quadratic one and then linearizing the quadratic problem. We present the detailed procedure in the following.
Firstly, we introduce a Lemma to show one of our main motivations to transform the quartic problem into a linear problem.
When the matrix satisfies the condition: , it is a global solution of the following problem:
Lemma 1 is easy to solve, because when , makes the objective in Eq. (5) attain the maximum. Since satisfying is a global solution of the problem in Eq. (5), it suggests that the problem in term of hash functions can be transformed into a linear problem in term of . Inspired by this observation, we can solve the quartic problem in term of hash functions. For brevity, in the following we first ignore the constraint in Eq. (4) and aim to transform the quartic problem in term of into the linear form as the objective in Eq. (5), and then obtain the low-dimensional projection matrix .
To reformulate the quartic problem in term of into a quadratic one, in the -th iteration, we set one discrete matrix to be and aim to solve the following quadratic problem in term of :
Note that the problem in Eq. (6) is not strictly equal to the problem in Eq. (4) w.r.t . However, when , it is the optimal solution of both Eq. (6) and Eq. (4) w.r.t . The details are shown in Proposition 1.
Obviously, if , it is the optimal solution of Eq. (6). Then we can consider the following formulation:
Similar to one major motivation of asymmetric discrete hashing algorithms  , in Eq. (7), the feasible region of , in the left term is more flexible than in the right term (Eq. (4)), i.e. the left term contains both two cases and . Only when , . It suggests that when , it is the optimal solution of Eq. (4). Therefore, when , it is the optimal solution of both Eq. (4) and Eq. (6). ∎
Since is a linear problem in term of , the main difficulty to solve Eq. (8) is caused by the non-convex quadratic term . Thus we aim to linearize this quadratic term in term of by introducing a linear regression model as follows:
Given a discrete matrix and one real nonzero matrix , , where is a diagonal matrix and is an identity matrix.
is an identity matrix.
It is easy to verify that is the global optimal solution to the problem . Substituting into the above objective, its minimum value is . Therefore, Theorem 1 is proved. ∎
Theorem 1 suggests that when , the quadratic problem in Eq. (8) can be linearized as a regression type. We show the details in Theorem 2.
When , where is a constant, the problem in Eq. (8) can be reformulated as:
Since , the problem in Eq. (9) equals:
which is a linear problem in term of .
Next, we demonstrate that there exists and such that . The details are shown in Theorem 3.
Suppose that a full rank matrix , where is a positive diagonal matrix and . If and , and a real nonzero matrix satisfies the conditions: , and is a non-negative real diagonal matrix with the -th diagonal element being .
Based on singular value decomposition (SVD), there exist matricesand , satisfying the conditions and , such that a real nonzero matrix is represented by , where is a non-negative real diagonal matrix. Then . Note that when the vectors in and
corresponds to the zero diagonal elements, they can be constructed by employing a Gram-Schmidt process such thatand , and these constructed vectors are not unique.
Since and , it can have when . Since there exists , should satisfy: and . Additionally, based on , there exists . Therefore, Theorem 3 is proved. ∎
is usually a positive-definite matrix thanks to , leading to . Based on Theorem 3, it is easy to construct a real nonzero matrix . Since , we set for simplicity, where and is a constant. Then Eq. (10) can be solved by alternatively updating , and . Actually, we can obtain by using an efficient algorithm in Theorem 4 that does not need to compute the matrices and .
For the inner -th iteration embedded in the outer -th iteration, the problem in Eq. (10) can be reformulated as the following problem:
where denotes binary codes in the outer -th iteration, and represents the obtained binary codes at the inner -th iteration embedded in the outer -th iteration.
For the inner loop embedded in the outer -th iteration, there are many choices for the initialization value . Here, we set . At the -th iteration, the global solution of Eq. (11) is . Additionally, for the inner loop, both and are global solutions of the -th iteration, it suggests that the objective of Eq. (11) will be non-decreasing and converge to at least a local optima. Therefore, we have the following theorem.
For the inner loop embedded in the outer -th iteration, the objective of Eq. (11) is monotonically non-decreasing in each iteration and will converge to at least a local optima.
Although Theorem 5 suggests that the objective of Eq. (11) will converge, its convergence is largely affected by the parameter , which is used to balance the convergence and semantic information in . Usually, the larger , the faster convergence but the more loss of semantic information. Therefore, we empirically set a small non-negative constant for , i.e. , where .
Based on Eq. (11), the optimal solution can be obtained. Then we utilize to replace in Eq. (6) for next iteration in order to obtain the optimal solution . Since and , based on Lemma 1, should satisfy . However, it is an overdetermined linear system due to . For simplicity, we utilize a least-squares model to obtain , which is .
3.2.2 Scalable symmetric discrete hashing with updating batch binary codes
Remark: The optimal solution of Eq. (6) is at least the local optimal solution of Eq. (4) only when . Given an initialization , can be alternatively updated by solving Eq. (11). However, with updating all binary codes at once on the non-convex feasible region, might change on two different discrete matrices, which would lead to the error (please see Figure 1a) and the objective of Eq. (4) becomes worse (please see Figure 1b). Therefore, we divide into a variety of batches and gradually update each of them in a sequential mode, i.e. batch by batch.
To update one batch of , i.e. , where is one column vector denoting the index of selected binary codes in , the optimization problem derived from Eq. (6) is:
Furthermore, although is high-dimensional for large , it is low-rank or can be approximated as a low-rank matrix. Similar to previous algorithms  , we can select () samples from training samples as anchors and then construct an anchor based pairwise matrix , which preserves almost all similarity information of . Let denote binary codes of anchors, and then utilize to replace for updating , Eq. (13) becomes:
where denotes obtained at the -th iteration in the outer loop, and .
|Algorithm 1: SDH_P|
|Input: Data , pairwise matrix ,|
|bit number , parameters , , batch size ,|
|anchor index , outer and inner loop|
|maximum iteration number , .|
|Output: and .|
|1: Initialize: Let , set to be the|
left-eigenvectors ofcorresponding to
|its largest eigenvalues, calculate|
|2: while not converge or reach maximum iterations|
|4. for to do|
|6: Do the SVD of ;|
|10: until convergence|
|12: end for|
|13: end while|
|14: Do the SVD of ;|
|Algorithm 2: GSDH_P|
|Input: Data , pairwise matrix ,|
|bit number , parameters , , batch size ,|
|anchor index , outer/inner maximum|
|iteration number , .|
|Output: and .|
|1: Initialize: Let , set to be the|
|left-eigenvectors of corresponding to|
|its largest eigenvalues, calculate ,|
|2: while not converge or reach maximum iterations|
|4. for to do|
|6: for to do|
|7: Calculating with fixing ,|
|10: until convergence|
|12: end for|
|13: end for|
|14: end while|
|15: Do the SVD of ;|
where denotes the batch binary codes at the -th iteration in the outer loop.
For clarity, we present the detailed optimization procedure to attain by updating each batch and calculate the projection matrix in Algorithm 1, namely symmetric discrete hashing via a pairwise matrix (SDH_P). For Algorithm 1, with gradually updating each batch of , the error usually converges to zero (please see Figure 1a) and the objective of Eq. (4) also converges to a better local optima (please see Figure 1b). Besides, we also display the retrieval performance in term of mean average precision (MAP) with a small batch size and different iterations in Figure 1c.
3.3 Greedy Symmetric Discrete Hashing
To make the update step more smooth, we greedily update each bit of the batch matrix . Suppose that is the -th bit of , it can be updated by solving the following optimization problem:
where and represent the -th and -th bits of , respectively, and is the -th bit of .
where is the obtained at the -th iteration in the outer loop, is the obtained at the -th iteration in inner loop embedded in the -th outer loop and .
In summary, we show the detailed optimization procedure in Algorithm 2, namely greedy symmetric discrete hashing via a pairwise matrix (GSDH_P). The error and the objective of Eq. (4) in Algorithm 2 and its retrieval performance in term of MAP with different number of iterations are shown in Figure 1a, b and c, respectively.
Out-of-sample: In the query stage, is employed as the binary codes of training data. We adopt two strategies to encode the query data point : (i) encoding it using ; (ii) similar to previous algorithms  , employing
as labels to learn a classification model, like least-squares, decision trees (DT) or convolutional neural networks (CNNs), to classify.
3.4 Convergence Analysis
Empirically, when , the proposed algorithms can converge to at least a local optima, although they cannot be theoretically guaranteed to converge in all cases. Here, we explain why gradually updating each batch of binary codes is beneficial to the convergence of hash code learning.
The hash code matrix can be represented as , where . Since , the objective of Eq. (18) is determined by:
Because of , Eq. (20) usually leads to
3.5 Time Complexity Analysis
In Algorithm 1, and . Step 1 calculating matrices and requires and operations, respectively. For the outer loop stage, the time complexity of steps 6, 7, 9 and 11 is , , and , respectively. Hence, the outer loop stage spends operations. Steps 14-16 to calculate the projection matrix spend , and , respectively. Therefore, the total complexity of Algorithm 1 is . Empirically, and .
For Algorithm 2, step 1 calculating , and spends at most . In the loop stage, the major steps both 7 and 9 require operations. Hence, the total time complexity of the loop stage is . Additionally, calculating the final costs the same time to the steps 14-16 in Algorithm 1. Therefore, the time complexity of Algorithm 2 is .
4 Extension to Other Hashing Algorithms
In this subsection, we illustrate that the proposed algorithm GSDH_P is suitable for solving many other pairwise based hashing models.
iteratively update each bit of the different loss functions defined on the Hamming distance of data pairs so that the loss functions of many hashing algorithms such as BRE, MLH  and EE  are incorporated into a general framework, which can be written as:
where represents the -th bit of binary codes and is obtained based on different loss functions with fixing all bits of binary codes except .
The algorithms   firstly relax into and then employ L-BFGS-B  to solve the relaxed optimization problem, followed by thresholding to attain the binary vector . However, our optimization mechanism can directly solve Eq. (22) without relaxing .
Since and , the problem in Eq. (22) can be equivalently reformulated as:
Since will consume large computation and storage costs for large , we select training anchors to construct based on different loss functions. Replacing in Eq. (24) with , it becomes:
Similar to solving Eq. (4), we can firstly obtain and then calculate . To attain , we still gradually update each batch by solving the following problem: