Introduction
Contentbased image retrieval is a fundamental problem in computer vision, media indexing, and data analysis. A common solution to the problem consists of assigning each image an indicative feature vector and retrieving similar images by defining a distance metric in the feature vector space.
One of the successful uses of deep learning is
data embedding [2, 9], where a network is used to map input data into a feature vector space, satisfying some desired distance properties. This technique has many applications, such as word embedding for machine translation [18], face embedding for identity recognition [27, 24, 30, 17], and many more. The main idea behind data embedding is to find a mapping from input space into a vector space where the distances in the embedding space conform with the desired task.In a typical scenario, the embedding space is several hundreds of bytes long (e.g., 512 bytes in FaceNet [24] embedding), and a new query may be compared to the existing images by nearestneighbor (NN) search. As the number of images scales up, the memory required to store all the examples becomes too large, and the time complexity to apply NN search becomes a critical bottleneck.
Many solutions have been proposed to mitigate this issue, including dimensionality reduction [15] and approximate NN search [19]. In recent years, a family of algorithms called Binary Hashing or Hamming Embedding has gained popularity. These algorithms find a mapping from a feature space into a Hamming space using a variety of methods. The main advantages of a binary representation are the significant reduction in storage and in the time required to apply vector comparisons: vectors are compared not in highdimensional Euclidean space, but rather in the Hamming space, utilizing the extremely fast XOR operation. This representation is highly valuable in mobile systems, as ondevice training is limited due to computational shortage. This requires adhoc hashing methods that can be computed on simple hardware, and that can be generalized well to novel data points.
Many modern Hamming embedding techniques are datadependent. Datadependent methods work by learning an affinity matrix between data points while attempting to preserve their affinities in Hamming space.
Insampled techniques aim at generating a set of binary codes, a single code for each data point, whose Hamming distances conform with the affinity matrix. Outofsample techniques deal with novel samples that are not known in advance. These techniques learn a general functional mapping that maps query points from feature space into Hamming space.Affinity between data pairs can be, for example, related to the metric distances between their associated features, or semantic relations indicating data points belonging to the same semantic class. The affinity matrix is usually relaxed to positive values, where small values indicate weak proximity (far pairs), and large values indicate strong proximity (near pairs). This encourages near pairs to be located close by in the Hamming space but does not constrain the far pairs.
We propose a binary hashing framework called Proximity Preserving Code (PPC). The main contribution of our method is that the binary code is constructed based on positive and negative proximity values, representing attractive and repulsive forces. These forces properly arrange the points in the Hamming space while respecting the pairwise affinities. Our solution models this proximity as a signed graph, and the code is computed by finding the mincut of the graph. This problem can be formulated as the maxcut problem (due to the negative values) and is known to be NPhard [1]. We demonstrate that our approach is more accurate and memoryefficient as compared to state of the art graphbased embeddings.
Previous Works
Previous works in Hamming embedding can be classified into two distinct categories: dataindependent and datadependent. Dataindependent methods are composed of various techniques for dimensionality reduction or techniques for dividing the Ndimensional space into buckets with equal distributions. One of the most popular dataindependent hashing methods is Locality Sensitive Hashing (LSH)
[3]. LSH is a family of hash functions that map similar items to the same hash bucket with higher probability than dissimilar items.
Datadependent methods learn the distribution of the data in order to create accurate and efficient mapping functions. These functions are usually comprised of three elements: the hash function, the similarity measure, and an optimization criterion. Hash functions vary and include linear functions [20], nearest vector assignment [6], kernel functions [11]
[14], and more. Similarity measures include Hamming distance and different variants of Euclidean or other computeintensive distances that are precomputed for vector assignment [7]. Optimization criteria mainly use variants of similarity preservation and code balancing. We will focus on binary hashing methods.An influential work in binary hashing methods is Spectral Hashing [29]. This method creates code words that preserve the data similarity. By defining an affinity matrix , the authors turn the hashing problems into a minimization of , subject to: , (code balancing), and (independence). This minimization problem for a single bit can be cast as a graph partitioning problem which is known to be NPhard. A good approximation for this problem is achieved by using spectral methods. The code is obtained by computing the eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian of and thresholding them at zero.
agh agh proposed the Anchor Graph Hashing, a hashing method utilizing the idea of a lowrank matrix that approximates the affinity matrix, to allow a graph Laplacian solution that is scalable both for training and outofsample computation. imh imh present Inductive Manifold Hashing, a method that learns a manifold from the data and utilizes it to find a Hamming embedding. They demonstrate their results with several approaches, including Laplacian eigenmaps and tSNE. sdh sdh and dgh dgh directly optimize the discrete problem, and employ discrete coordinate descent to achieve better precision on the graph problem. Scalable Graph Hashing with Feature Transformation [8] uses a feature transformation method to approximate the affinity graph, allowing faster computation on large scale datasets. They also proposed a sequential approach to learn the code bitbybit, allowing for errorcorrecting of the previously computed bits. lghsr lghsr revisit the spectral solution to the Laplacian graph and propose a spectral rotation that improves the accuracy of the solutions.
All of the above approaches formulate the graph Laplacian by defining an affinity matrix that takes into account the similarities between points in the training set. However, they do not address the dissimilarity, or pushpull forces in the data set. In this paper, we propose a binary embedding method that employs an affinity matrix of both positive and negative values. We argue that this type of affinity better represents the relationships between data points, allowing a more accurate code generation. The characteristics and the advantages of this work are as follows:

Our code is constructed by solving a signed graphcut problem, to which we propose a highly accurate solution. We demonstrate that the signed graph provides a better encoding for the forces existing in the coding optimization. We show that the commonly used spectral solution, which works well in the unsigned graphcut problems, is unnecessary, costly, and inferior in this scenario.

The code is computed one bit at a time, allowing for error correction during the construction of the hashing functions.

We split the optimization into two steps. We first optimize for a binary vector representing the insample data, and then we fit the hashing functions to obtain accurate code for outofsample points.

Our framework is flexible, allowing various proximity definitions, including semantic proximity. This can be useful for many applications, especially in low computation environments.
Problem Formulation
We are given a set of data points , , in some vector space, and a proximity relation between pairs of points , where . We assign each pair of points in to be in the Near or Far
group, according to some proximity measure. This proximity measure can have a semantic meaning, geometric meaning, or any other adjacency relation. Formally, we define
and
Note that and induce a partition of into two disjoint sets: where .
In a classification scenario, for example, two points belonging to the same class will be defined as Near; otherwise, they will be defined as Far. Another example of an adjacency matrix is a neighborhood of a certain radius. For a distance metric in and a given radius we define:
(1) 
Denote the length binary code of point by . Our goal is to find binary codes that satisfy the following two requirements:

Compactness: The length of the code should be short, i.e., should be as small as possible.

Proximity Preserving: The binary code should preserve the proximity of . That is, there exists a constant s.t. for each pair , and for each , where stands for the Hamming distance between two binary codes^{1}^{1}1In fact, this definition is twice the Hamming distance, but we stick with it for sake of clarity.:
It can be shown that if proximity relationships are determined according to or distance between points in , the Proximity Preserving requirement can be fully satisfied using large enough codes (i.e., is large). However, due to the compactness requirement, we want to relax the proximity preserving requirement and try to find an optimal code for a given code length.
Denote a proximity label, , associated with each pair of points :
For a given value we define:
(2) 
We would like that for each pair
, and accordingly we define a loss function:
(3) 
The empirical loss for the entire set reads:
(4) 
This loss penalizes pairs of points that are mislabeled, that is, pairs of points in whose Hamming distance is smaller than , or pairs of points in whose Hamming distance is larger than .
Definition 1 (Proximity Preserving Code).
Given a set of data points along with their proximity labels, , a Proximity Preserving Code (PPC) of length is a binary code, , , that minimizes .
In the following we describe the procedure to generate the PPC. In particular, we show that finding PPC for a given set of points boils down to applying an integer lowrank matrix decomposition. We provide two possible approximated solutions and show their connection to the minimum signed graphcut problem. Finally, we provide a solution for extracting hashing functions for outofsample data points.
Proximity Preserving Code
Recall the definition of (Equation 2): . Substituting the Hamming distance into this expression we get:
(5) 
where we define .
To simplify notations we define a code matrix by stacking the code words along its columns:
Similarly we define
Equation 5 can now be defined over the entries of matrix :
and the total loss (Equation 4) is:
(6) 
Denote by the rows of (similarly, the columns of ) such that
Each is a vector representing the bit of all the code words ( words). The matrix can now be represented as a linear sum of matrices:
(7) 
where is a rank1 matrix extracted from the bit of the code words. Thus, each additional bit can either increase the rank of matrix or leave it the same. Our goal then is to find a low rank matrix , minimizing the loss defined in Equation 6.
The minimization function defined in Equation 6 introduces a combinatorial problem which is NPhard. Therefore we relax the binary loss function and redefine it using a logistic loss function:
where . The relaxed total loss is therefore
(8) 
Bit Optimization
In the proposed process we generate the codes for data points in a sequential manner, bit after bit. In the following we detail the minimization process for bit . This is also illustrated in Figure 0(a).
At this step we assume that bits of PPC code have already been generated. Denote
For the bit, we minimize Equation 8 with respect to and as follows:
(9) 
Note that is already known at step . As mentioned above, is minimized using alternate minimization, described below.
Step I  optimizing :
is convex with respect to , so any scalar search is applicable here.
Since the loss is nearly linear for , a fast yet sufficiently accurate
approximation for is to choose the
value that equates the number of misclassified pairs in the and sets.
For the current code , and a constant value , define the misclassified sets:
and similarly
The value of is set such that the cardinality of the two sets is equal, i.e., the that satisfies:
(10) 
and accordingly .
This is visualized in Step I of Figure 0(a). We show a histogram of the near pairs of samples in blue and the far pairs in red, and is the vertical black line thresholding the Hamming distance.
Step II  optimizing :
For the evaluated ,
Equation 9 becomes:
where we define . In a forward greedy selection process, we approximate the potential decrease in the loss using the gradient. Our goal is to find that minimizes or alternatively maximizes where:
where stands for the derivative of the logistic loss function .
Defining , we arrive at the following maximization problem:
where the maximization is taken over all entries of . For the sake of simplicity we omit the superscript and denote by . Collecting into matrix , s.t. , the above maximization can be simply expressed in a matrix form:
(11) 
If the weight matrix W was all positive (all entries are positive values), this problem can be interpreted as a graph mincut problem. In our problem, however, the matrix W is comprised of both positive and negative values, indicating pairs that are properly and improperly assigned as near or far according to the code computed. This is termed in the literature a signed mincut problem which is equivalent to the maxcut problem whose solution is NPhard.
In the proposed solution we start with an initial guess for the bit vector and improve it by using a forward greedy selection scheme. We present two iterative approaches for the selection scheme: vector update and bit update.
Vector Update
Given an initial guess for , the vector update method updates the entire vector at once. At each iteration the vector is improved by applying:
where is the updated vector that satisfies: . The following four theorems prove that is a better vector than :
Theorem 1.
For any matrix and , if , then
Theorem 2.
Assuming is Positive Semidefinite, if , then .
Theorem 3.
Any symmetric matrix can become Positive Semidefinite by applying where is the smallest eigenvalue of .
Theorem 4.
Adding a constant value to the diagonal of the weight matrix W will not affect the output code computed.
Proofs are given in the Appendix. Using the above theorems, Algorithm 1 summarizes the vector update iterations.
Bit Update
Unlike the vector update, the bit update method changes one bit at a time: For each bit in vector , we flip the bit and determine whether the new value improved the objective . This is repeated for each bit sequentially, and over the entire vector, until convergence. This procedure can be applied very efficiently using the following scheme: Define , where is a onehot vector with the entry of at the coordinate. Accordingly, is the vector with at the coordinate. When optimizing the bit:
It can be verified that the only term affecting the optimization is . Therefore we can optimize each bit in by looking at the value of the element of . Subsequently, the only elements affecting this value in the matrix are in the column of . Thus,
(12) 
where we denote by the column of W. We apply this optimization scheme for each bit sequentially, and repeatedly over the entire vector , until convergence. Each bit update is inserted immediately into so that the optimization for the bit will account for the preceding bits that have been calculated. This update is computationally inexpensive, requiring operations for each bit update, and operations for one round over the entire . This method is summarized in Algorithm 2.
The two algorithms presented for the iterative bit optimization scheme provide a solution to the maxcut problem where both positive and negative weights appear on the graph edges. The iterations require several light computations and stop when a local maximum is reached and the iteration scheme can no longer improve upon the current bit vector. We show in the Experiments Section that the bit update scheme achieves better codes than the vector update and is therefore preferable.
Initial Guess
Our method is based on an iterative scheme. Therefore, we start the optimization with an initial guess and improve upon it. A common solution is to relax the constraints and allow realvalued solutions. This enables the maximization problem to be cast as an eigenvalue problem. The final solution is then obtained by thresholding the results, in a similar manner to sh sh. Interestingly, we have found that starting from a random guess and applying the suggested iterations produces more accurate solutions and with much faster compute time than the traditional eigenvalue solutions.
In conclusion, the algorithm provided above can be used to solve the signed graph mincut problem where consists of both positive and negative weights. We show empirically that our solution is equivalent to or outperforms other methods, by starting from a random guess solution and applying an update scheme until convergence. Note, that the proposed update schemes can improve upon any approximated solution suggested in the literature, as the suggested iterations do not deteriorate and can only improve the objective function. Our evaluations and experimental results are provided in the Experiments Section.
Signed Graph MinCut Problem
Equation 11 suggests that our problem can be cast as a signed graph mincut problem. A weighted graph is represented by a vertex set and weights for each pair of its vertices . The weight of the minimum cut is given by the following problem:
(13) 
where is an indicator vector s.t. if and if . The above minimization is an integer quadratic program whose solution is known to be NPhard [1]. Note that the above formulation can be expressed similarly by , which is similar to the expression given in Equation 11. The weights collected in Equation 13 refer only to pairs . Thus, the minimal cut aims at including as many negative weights as possible while excluding positive weights. Since we are dealing with signed graphs, balancing the cut is not critical as it is in unsigned graphs since cutting a small component with few edges does not necessarily provide the smallest cut.
alon2004approximating alon2004approximating define the above problem as a norm and provide a semidefinite relaxation: s.t. . The semidefinite program can be solved within an additive error of in polynomial time. Alon and Naor suggested three techniques to round the semidefinite solution into a binary solution (), which provides an approximation to the original solution up to a constant factor (, called Grothendieck’s constant, ). In the Experiments Section we show that our iterative update approach can improve over Alon and Naor’s solution when taking their solution as an initial guess. Moreover, taking a random guess as an initial solution provides a final solution that is comparable or better, so the benefit of using a costly approximated solution as initial guess is questionable.
The minimization in 13 can be equivalently rewritten in a quadratic form:
(14) 
The matrix form of the above minimization reads: , where is the Laplacian of and is a diagonal matrix . The Laplacian of a graph is frequently used for graph clustering or graphcut using spectral methods. It was shown that taking the secondsmallest eigenvector (the Fiedler vector) and thresholding it at zero provides a relaxed approximation for the minimization in Eq. 14 (see von2007tutorial for more details). However, spectral methods commonly deal with positive weights where the matrix is guaranteed to be positive semidefinite. This is not the situation in our case where eigenvalues might be negative as well.
kunegis2009slashdot kunegis2009slashdot suggested an alternative for graph Laplacian for signedgraphs: , where and proved that is positive semidefinite. However, knyazev2017signed knyazev2017signed argue that the signed Laplacian does not give better clustering results than the original definition of Laplacian, even if the graph is signed. We show in our experiments that neither solution works as well as the greedy update scheme suggested in this paper.
Optimizing the Hashing Functions
Finally, we arrive at the outofsample extension and explain how to learn hashing functions to encode outofsample data points. We found that it is preferable to first optimize for the binary vector , then learn a hashing function requiring , for . Optimizing directly for the hashing function yields a nonlinear optimization that often provides inaccurate results. Splitting the optimization into two steps allows each step to be exploited in the best manner. We assume that novel data points will be drawn from the same distribution of the given data . Therefore, the hashing functions can be optimized using the empirical loss over .
We denote by the optimal resulting from the first step. This vector encodes the optimal binary values for the bit (over all data points). We then train a binary classifier over the input pairs , by minimizing a loss function:
where denotes the classifier’s parameters. We use kernel SVM [23] with Gaussian kernels to classify the points into , but any standard classifier can be applied similarly. At step , we train the hash functions and construct the bit for the binary codes: . This bit is updated immediately in the codes , allowing the bit to account for the errors in . This error correcting scheme is another benefit of the twostep solution.
As we proceed, the algorithm adds more bits to the PPC code. Each additional bit is aimed at decreasing the total loss. The process terminates when the total loss is below a given threshold, or when the number of bits exceeds .
Experiments and Results

CIFAR10  

Code Length  12  16  24  32  48  64  96  128  
SH  0.121379  0.121498  0.119628  0.121108  0.126965  0.12771  0.130133  0.129777  
IMHtSNE  0.154169  0.165783  0.168781  0.150094  0.165669  0.16049  0.163685  0.174634  
IMHLE  0.170701  0.164687  0.144931  0.154949  0.165396  0.152761  0.162143  0.152876  
SGH  0.129056  0.12776  0.131998  0.138006  0.136264  0.144698  0.153421  0.157661  
LGHSR  0.136739  0.144933  0.149855  0.148962  0.144178  0.144102  0.150296  0.147022  
SDH  0.249316  0.191575  0.229962  0.250167  0.227537  0.257303  0.288238  0.326233  
PPC  0.28365  0.312302  0.308905  0.329332  0.343186  0.352296  0.354291  0.355555 
To illustrate the optimization process, a synthetic example is shown in Figure 2. In this figure 300 points are drawn in 2D in a range of . The figure shows the first 4 first bits (red/blue indicate ) where the proximity matrix was generated with a rneighborhood proximity measure. This demonstrates how the PPC algorithm tries to separate the vector space into two labels by balancing between areas with high neighbor density and correcting for the errors of previous bits.
The mutual relationships between the actual distances and the Hamming distances are illustrated in Figure 3, which shows the joint histogram of vs. the Euclidean distances, for three cases. This is another synthetic 2D example with varying rneighborhood proximity measures. The xaxis indicates the Hamming distances of the generated code and the yaxis indicates the actual Euclidean distances. The histograms are plotted as grayscale images where the grayvalue in each entry indicates the number of pairs with the associated distances. The brighter the grayvalue, the greater the number of pairs (we display the log of the actual values for a better visualization). For each case we see that most of the pairs with Euclidean distance below (the pairs labeled as ”Near” in this example) are concentrated to the left of the respective Hamming distances, . These were also the final values at the last step of each case. It is interesting to note that the conditional distributions of at the two sides of the alpha values are wide, while the order of the Euclidean distances is not necessarily preserved in the respective Hamming distances. This indicates that the bits are allocated solely to optimize the neighborhood constraints, and not to meet any other requirements such as preserving the ordinal distances. This allows for optimal allocation of the bit resources.
We evaluate Proximity Preserving Code on several public datasets: MNIST [4], CIFAR10 [10], and LabelMe [22]. CIFAR10 [10] is a labeled subset of the 80 million tiny images dataset, consisting of 60,000 32x32 color images represented by 512dimensional GIST feature vectors [21]. It is split into 59,000 images in the training set and 1000 in the test set. MNIST [4] is the wellknown database of handwritten digits in grayscale images of size 28x28. The dataset is split into a training set of 69,000 samples and a test set of 1,000 samples. LabelMe [22] has 20,019 training images and 2000 test images, each with a 512D GIST descriptor. The descriptors were dimensionality reduced to 40D using PCA. We use this dataset as unsupervised, and the affinity is defined by thresholding in the Euclidean GIST space such that each training point has an average of 100 neighbors.
We evaluate the results by computing a precisionrecall graph of varying Hamming thresholds (denoted by in Equation 2). We compare our methods to the following stateoftheart spectral hashing methods: Spectral Hashing (SH) [29], Anchor Graph Hashing (AGH) [16], Inductive Manifold Hashing (IMH) [26], Scalable Graph Hashing (SGH) [8], Supervised Discrete Hashing (SDH) [25], and Large Graph Hashing with Spectral Rotation (LGHSR) [13]. We use the default settings that the authors provided, and as in imh imh we use settings of anchor number and neighborhood number . For IMH, we show results using both Laplacian eigenmaps (LE) and tSNE.
We first compare our method in the unsupervised (or selfsupervised) setting to the unsupervised methods listed above. Results are shown in Figure 3(a). We show the precisionrecall graph of the LabelMe dataset with the selfsupervised affinity labels. The results are for the 50 bit code computed for the train set vs. the test set. The results clearly show that our code is more accurate than the other methods over all Hamming thresholds.
Next, we compare our method in the supervised scenario. Figure 3(b) shows the precisionrecall of 50 bit codes for the MNIST dataset. The results are computed for the test set only, showing that our method outperforms the other methods in the more challenging outofsample scenario. Similarly, Figure 3(c) shows the comparison for the CIFAR10 dataset.
To compare performance at different code lengths, we calculate the area under curve (AUC) for the precisionrecall graph in the outofsample scenario. Table 1 shows our results on the CIFAR10 dataset, compared to the results of the spectral methods mentioned above. Our method consistently outperforms the other methods in both short and long codes.
Our solution for the signed mincut problem includes an iterative scheme that continuously improves the initial guess. As mentioned before, we argue that the initial guess does not play a significant role in the final solution. In fact, at the end of the iterative process, an initial guess based on spectral methods provides similar results to a random initial guess.
In the following experiment, we compute the codes only for the insample points (using a fixed random seed) and plot the loss as shown in Equation 4 at each code length. We show our results on the benchmark presented in mlh mlh for six small datasets, consisting of 1000 training points. Since we use the full versions of the MNIST and LabelMe datasets in the previous sections, we show here the four remaining datasets. We present the results generated from the following initial guesses: signed eigenvector corresponding to the smallest nontrivial eigenvalue of the Laplacian (L) [28], signed Laplacian (SL) [12], the sign of the random projection of the 3 smallest nontrivial eigenvalues of the Laplacian [1] (AN), and random guess (R). We show the effect of improving upon the initial guesses using bit update (BU) as presented in Algorithm 2. Results are shown in Figure 5. The results clearly show that the random guess with bit update performs as well as or surpasses the costly spectral computations, while the bit update improves upon all of the initial guesses.
A comparison between vector update (Algorithm 1) and bit update (Algorithm 2) for different initial guesses is shown in Figure 6. It is clear that the bit update method outperforms the vector update. This is reasonable as the bit update is an optimization with smaller steps, allowing for a broader search for the optimum, whereas the vector update takes large steps and converges quickly into a local optimum.
Conclusions
We have shown a binary hashing method called Proximity Preserving Code (PPC) based on the signed graphcut problem. We propose an approximation to this problem and show its advantages over other methods suggested in the literature. We also introduce a hashing framework that can work for both supervised and unsupervised datasets. The framework computes binary code that is more accurate than stateoftheart graph hashing algorithms, especially in the challenging outofsample scenario. We believe the use of the signed graph problem instead of relaxation to the standard graph problem can prove beneficial in other algorithms as well.
Appendix A Appendix
The following four theorems are used in the Vector Update method in Algorithm 1. The proofs are provided below.
Theorem 1.
for any matrix and , if , then
Proof.
Denote . Thus . Since , the maximal value is obtained when . We arrive at . ∎
Theorem 2.
Assuming is positive semidefinite (PSD), if , then .
Proof.
We express vectors using the eigenvectors of W, , and . Given
(15)  
(16) 
we would like to show that
Using the PSD property of W (), we define
Starting from our original assumption we have: . Using the triangular inequality we have:
which follows that and accordingly
∎
Theorem 3.
A symmetric matrix can become positive semidefinite by applying where is the smallest eigenvalue of .
Proof.
According to the Gershgorin Circle Theorem [5], for an matrix W, define . All eigenvalues of are in at least one of the disks . Therefore, by adding the smallest eigenvalue to , the disks will only contain values greater than or equal to zero. ∎
Theorem 4.
Adding a constant value to the diagonal of the weight matrix W will not affect the output code computed.
Proof.
References

[1]
(2004)
Approximating the CutNorm via Grothendieck’s Inequality.
In
Proceedings of the Thirtysixth Annual ACM Symposium on Theory of Computing
, pp. 72–80. Cited by: Introduction, Signed Graph MinCut Problem, Experiments and Results. 
[2]
(2005)
Learning a Similarity Metric Discriminatively, with Application to Face Verification.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Vol. 1, pp. 539–546. Cited by: Introduction.  [3] (2004) Localitysensitive Hashing Scheme Based on Pstable Distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. Cited by: Previous Works.

[4]
(2012)
The MNIST Database of Handwritten Digit Images for Machine Learning Research
. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: Experiments and Results.  [5] (1931) Uber die Abgrenzung der Eigenwerte einer Matrix. Izvestija Akademii Nauk SSSR, Serija Matematika 7 (3), pp. 749–754. Cited by: Appendix A.
 [6] (2013) Kmeans Hashing: An AffinityPreserving Quantization Method for Learning Binary Compact Codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2938–2945. Cited by: Previous Works.
 [7] (2011) Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 117–128. Cited by: Previous Works.

[8]
(2015)
Scalable Graph Hashing with Feature Transformation.
In
TwentyFourth International Joint Conference on Artificial Intelligence
, Cited by: Previous Works, Figure 4, Table 1, Experiments and Results.  [9] (2015) Siamese Neural Networks for Oneshot Image Recognition. In ICML Deep Learning Workshop, Vol. 2. Cited by: Introduction.
 [10] (2014) The CIFAR10 Dataset. online: http://www. cs. toronto. edu/kriz/cifar. html. Cited by: Experiments and Results.
 [11] (2009) Learning to Hash with Binary Reconstructive Embeddings. In Advances in Neural Information Processing Systems, pp. 1042–1050. Cited by: Previous Works.
 [12] (2009) The Slashdot Zoo: Mining a Social Network with Negative Edges. In Proceedings of the 18th International Conference on World Wide Web, pp. 741–750. Cited by: Experiments and Results.
 [13] (2017) Large Graph Hashing with Spectral Rotation. In ThirtyFirst AAAI Conference on Artificial Intelligence, Cited by: Figure 4, Table 1, Experiments and Results.
 [14] (2015) Deep Learning of Binary Hash Codes for Fast Image Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 27–35. Cited by: Previous Works.
 [15] (2004) Principal Component Analysis. Encyclopedia of Biopharmaceutical Statistics. New York: Marcel Dekker. Cited by: Introduction.
 [16] (2011) Hashing with Graphs. In Proceedings of the 28th International Conference on Machine Learning, pp. 1–8. Cited by: Figure 4, Experiments and Results.

[17]
(2017)
Sphereface: Deep Hypersphere Embedding for Face Recognition
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Introduction. 
[18]
(2013)
Efficient Estimation of Word Representations in Vector Space
. 1st International Conference on Learning Representations. Cited by: Introduction. 
[19]
(2014)
Scalable Nearest Neighbor Algorithms for High Dimensional Data
. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (11), pp. 2227–2240. Cited by: Introduction.  [20] (2011) Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the 28th International Conference on Machine Learning, pp. 353–360. Cited by: Previous Works.
 [21] (2001) Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision 42 (3), pp. 145–175. Cited by: Experiments and Results.
 [22] (2008) LabelMe: a Database and WebBased Tool for Image Annotation. International Journal of Computer Vision 77 (13), pp. 157–173. Cited by: Experiments and Results.

[23]
(2001)
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
. MIT Press. Cited by: Optimizing the Hashing Functions.  [24] (2015) Facenet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: Introduction, Introduction.
 [25] (2015) Supervised Discrete Hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 37–45. Cited by: Figure 4, Table 1, Experiments and Results.
 [26] (2013) Inductive Hashing on Manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1562–1569. Cited by: Figure 4, Table 1, Experiments and Results.
 [27] (2014) Deepface: Closing the Gap to Humanlevel Performance in Face Verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1708. Cited by: Introduction.

[28]
(2007)
A Tutorial on Spectral Clustering
. Statistics and Computing 17 (4), pp. 395–416. Cited by: Experiments and Results.  [29] (2009) Spectral Hashing. In Advances in Neural Information Processing Systems, pp. 1753–1760. Cited by: Previous Works, Figure 4, Table 1, Experiments and Results.
 [30] (2016) A Discriminative Feature Learning Approach for Deep Face Recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: Introduction.
Comments
There are no comments yet.