NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
Binary hashing is a wellknown approach for fast approximate nearestneighbor search in information retrieval. Much work has focused on affinitybased objective functions involving the hash functions or binary codes. These objective functions encode neighborhood information between data points and are often inspired by manifold learning algorithms. They ensure that the hash functions differ from each other through constraints or penalty terms that encourage codes to be orthogonal or dissimilar across bits, but this couples the binary variables and complicates the already difficult optimization. We propose a much simpler approach: we train each hash function (or bit) independently from each other, but introduce diversity among them using techniques from classifier ensembles. Surprisingly, we find that not only is this faster and trivially parallelizable, but it also improves over the more complex, coupled objective function, and achieves stateoftheart precision and recall in experiments with image retrieval.
READ FULL TEXT VIEW PDF
Learningbased binary hashing has become a powerful paradigm for fast se...
read it
In this work, we firstly propose deep network models and learning algori...
read it
In supervised binary hashing, one wants to learn a function that maps a
...
read it
Learning to hash is an efficient paradigm for exact and approximate near...
read it
We propose an iterationfree source separation algorithm based on
Winner...
read it
We propose theoretical and empirical improvements for twostage hashing
...
read it
Along with data on the web increasing dramatically, hashing is becoming ...
read it
This project collects the different accepted papers and their link to Arxiv or Gitxiv
Information retrieval tasks such as searching for a query image or document in a database are essentially a nearestneighbor search (Shakhnarovich et al., 2006). When the dimensionality of the query and the size of the database is large, approximate search is necessary. We focus on binary hashing (Grauman and Fergus, 2013)
, where the query and database are mapped onto lowdimensional binary vectors, where the search is performed. This has two speedups: computing Hamming distances (with hardware support) is much faster than computing distances between highdimensional floatingpoint vectors; and the entire database becomes much smaller, so it may reside in fast memory rather than disk (for example, a database of 1 billion real vectors of dimension 500 takes 2 TB in floating point but 8 GB as 64bit codes).
Constructing hash functions that do well in retrieval measures such as precision and recall is usually done by optimizing an affinitybased objective function that relates Hamming distances to supervised neighborhood information in a training set. Many such objective functions have the form of a sum of pairwise terms that indicate whether the training points and are neighbors:
Here, is the dataset of highdimensional feature vectors (e.g., SIFT features of an image), are binary hash functions and is the bit code vector for input , means minimizing over the parameters of the hash function (e.g. over the weights of a linear SVM), and
is a loss function that compares the codes for two images (often through their Hamming distance
) with the groundtruth value that measures the affinity in the original space between the two images and (distance, similarity or other measure of neighborhood). The sum is often restricted to a subset of image pairs (for example, within the nearest neighbors of each other in the original space), to keep the runtime low. The output of the algorithm is the hash function and the binary codes for the training points, where for . Examples of these objective functions are Supervised Hashing with Kernels (KSH) (Liu et al., 2012), Binary Reconstructive Embeddings (BRE) (Kulis and Darrell, 2009) and the binary Laplacian loss (an extension of the Laplacian Eigenmaps objective; Belkin and Niyogi, 2003):(1)  
(2)  
(3) 
where for KSH is if , are similar and if they are dissimilar; for BRE (where the dataset is scaled or normalized so the Euclidean distances are in ); and for the Laplacian loss if , are similar and if they are dissimilar (“positive” and “negative” neighbors). Other examples of these objective functions include models developed for dimension reduction, be they spectral such as Locally Linear Embedding (Roweis and Saul, 2000) or Anchor Graphs (Liu et al., 2011), or nonlinear such as the Elastic Embedding (CarreiraPerpiñán, 2010) or SNE (van der Maaten and Hinton, 2008); as well as objective functions designed specifically for binary hashing, such as Semisupervised sequential Projection Learning Hashing (SPLH) (Wang et al., 2012). They all can produce good hash functions. We will focus on the Laplacian loss in this paper.
In designing these objective functions, one needs to eliminate two types of trivial solutions. 1) In the Laplacian loss, mapping all points to the same code, i.e., , is the global optimum of the positive neighbors term (this also arises if the codes are realvalued, as in Laplacian eigenmaps). This can be avoided by having negative neighbors. 2) Having all hash functions (all bits of each vector) being identical to each other, i.e., for each . This can be avoided by introducing constraints, penalty terms or other mathematical devices that couple the bit dimensions. For example, in the Laplacian loss (3) we can encourage codes to be orthogonal through a constraint (Weiss et al., 2009) or a penalty term
(the latter requiring a hyperparameter that controls the weight of the penalty)
(Ge et al., 2014), although this generates dense matrices of . In the KSH or BRE losses (1), squaring the dot product or Hamming distance between the codes couples the bits.An important downside of these approaches is the difficulty of their optimization. This is due to the fact that the objective function is nonsmooth (implicitly discrete) because of the binary output of the hash function. There is a large number of such binary variables (), a larger number of pairwise interactions (, less if using sparse neighborhoods) and the variables are coupled by the said constraints or penalty terms. The optimization is approximated in different ways. Most papers ignore the binary nature of the
codes and optimize over them as real values, then binarize them by truncation (possibly with an optimal rotation;
Yu and Shi, 2003; Gong et al., 2013), and finally fit a classifier (e.g. linear SVM) to each of the bits separately. For example, for the Laplacian loss with constraints this involves solving an eigenproblem on as in Laplacian eigenmaps (Belkin and Niyogi, 2003; Weiss et al., 2009; Zhang et al., 2010), or approximated using landmarks (Liu et al., 2011). This is fast, but relaxing the codes in the optimization is generally far from optimal. Some recent papers try to respect the binary nature of the codes during their optimization, using techniques such as alternating optimization, mincut and GraphCut (Boykov et al., 2001; Lin et al., 2014b; Ge et al., 2014) or others (Lin et al., 2013), and then fit the classifiers, or use alternating optimization directly on the hash function parameters (Liu et al., 2012). Even more recently, one can optimize jointly over the binary codes and hash functions (Ge et al., 2014; CarreiraPerpiñán and Raziperchikolaei, 2015; Raziperchikolaei and CarreiraPerpiñán, 2015). Most of these approaches are slow and limited to small datasets (a few thousand points) because of the quadratic number of pairwise terms in the objective.We propose a different, much simpler approach. Rather than coupling the hash functions into a single objective function, we train each hash function independently from each other and using a singlebit objective function of the same form. We show that we can avoid trivial solutions by injecting diversity into each hash function’s training using techniques inspired from classifier ensemble learning. Section 2 discusses relevant ideas from the ensemble learning literature, section 3 describes our independent Laplacian hashing algorithm, section 4 gives evidence with image retrieval datasets that this simple approach indeed works very well, and section 5 further discusses the connection between hashing and ensembles.
At first sight, optimizing (3) without constraints does not seem like a good idea: since separates over the bits, we obtain independent identical objectives, one over each hash function, and so they all have the same global optimum. And, if all hash functions are equal, they are equivalent to using just one of them, which will give a much lower precision/recall. In fact, the very same issue arises when training an ensemble of classifiers (Dietterich, 2000; Zhou, 2012; Kuncheva, 2014). Here, we have a training set of input vectors and output class labels, and want to train several classifiers whose outputs are then combined (usually by majority vote). If the classifiers are all equal, we gain nothing over a single classifier. Hence, it is necessary to introduce diversity among the classifiers so that they disagree in their predictions. The ensemble learning literature has identified several mechanisms to inject diversity. The most important ones that apply to our binary hashing setting are as follows:
This can be done by: 1) Using different feature subsets for each classifier. This works best if the features are somewhat redundant. 2) Using different training sets for each classifier. This works best for unstable algorithms (whose resulting classifier is sensitive to small changes in the training data), such as decision trees or neural nets, and unlike linear or nearest neighbor classifiers. A prominent example is bagging
(Breiman, 1996), which generates bootstrap datasets and trains a model on each.This is only possible if local optima exist (as for neural nets) or if the algorithm is randomized (as for decision trees). This can be done by using different initializations, adding noise to the updates or using different choices in the randomized operations in the algorithm (e.g. the choice of split in decision trees, as in random forests;
Breiman, 2001).For example, different parameters (e.g. the number of neighbors in a nearestneighbor classifier), different architectures (e.g. neural nets with different number of layers or hidden units), or different types of classifiers altogether.
There are other variations in addition to these techniques, as well as combinations of them.
The connection of binary hashing with ensemble learning offers many possible options, in terms of the choice of type of hash function (“base learner”), binary hashing (singlebit) objective function, optimization algorithm, and diversity mechanism. In this paper we focus on the following choices. We use linear and kernel SVMs as hash functions. Without loss of generality (see later), we use the Laplacian objective (3), which for a single bit takes the form
(4) 
To optimize it, we use a twostep approach, where we first optimize (4) over the bits and then learn the hash function by fitting to it a binary classifier. (It is also possible to optimize over the hash function directly with the method of auxiliary coordinates; CarreiraPerpiñán and Raziperchikolaei, 2015; Raziperchikolaei and CarreiraPerpiñán, 2015, which essentially iterates over optimizing (4) and fitting the classifier.) The Laplacian objective (4) is NPcomplete if we have negative neighbors (i.e., some ). We approximately optimize it using a mincut algorithm (as implemented by Boykov et al., 2001) applied in alternating fashion to submodular blocks as described in Lin et al. (2014a). This first partitions the points into disjoint groups containing only nonnegative weights. Each group defines a submodular function (specifically, quadratic with nonpositive coefficients) whose global minimum can be found in polynomial time using mincut. The order in which the groups are optimized over is randomized at each iteration (this improves over using a fixed order). The approximate optimizer found depends on the initial .
Finally, we consider three types of diversity mechanism (as well as their combination):
Each hash function is initialized from a random bit vector .
Each hash function uses a training set of points that is different and (if possible) disjoint from that of other hash functions. We can afford to do this because in binary hashing the training sets are potentially very large, and the computational cost of the optimization limits the training sets to a few thousand points. Later we show this outperforms using bootstrapped training sets.
Each hash function is trained on a random subset of features sampled without replacement (so the features are distinct). The subsets corresponding to different hash functions may overlap.
These mechanisms are applicable to other objective functions beyond (4). We could also use the same training set but construct differently the weight matrix in (4) (e.g. using different numbers of positive and negative neighbors).
Several binary hashing objective functions that differ in the general case of bits become essentially identical in the case. For example, expanding the pairwise terms in (1)–(3) (noting that if ):
So the Laplacian and KSH objectives are in fact identical, and all three can be written in the form of a binary quadratic function without linear term (or a Markov random field with quadratic potentials only):
(5) 
with an appropriate, datadependent neighborhood symmetric matrix of . This problem is NPcomplete in general (Garey and Johnson, 1979; Boros and Hammer, 2002; Kolmogorov and Zabih, 2003), when has both positive and negative elements, as well as zeros. It is submodular if has only nonpositive elements, in which case it is equivalent to a mincut/maxflow problem and it can be solved in polynomial time (Boros and Hammer, 2002; Greig et al., 1989).
More generally, any objective function of a binary vector that has the form and which only depends on Hamming distances between bits can be written as . Even more, an arbitrary function of 3 binary variables that depends only on their Hamming distances can be written as a quadratic function of the 3 variables. However, for 4 variables or more this is not generally true (see appendix A).
Training the hash functions independently has some important advantages. First, training the functions can be parallelized perfectly. This is a speedup of one to two orders of magnitude for typical values of (32 to 200 in our experiments). Coupled objective functions such as KSH do not exhibit obvious parallelism, because they are trained with alternating optimization, which is inherently sequential.
Second, even in a single processor, binary optimizations over variables each is generally easier than one binary optimization over variables. This is so because the search spaces contain and states, resp., so enumeration is much faster in the independent case (even though it is still impractical). If using an approximate polynomialtime algorithm, the independent case is also faster if the runtime is superlinear on the number of variables: the asymptotic runtimes will be and with , respectively. This is the case for the best practical GraphCut (Boykov et al., 2001) and maxflow/mincut algorithms (Cormen et al., 2009).
Third, the solution exhibits “nesting”, that is, to get the solution for bits we just need to take a solution with bits and add one more bit (as happens with PCA). This is unlike most methods based on a coupled objective function (such as KSH), where the solution for bits cannot be obtained by adding one more bit, we have to solve for bits from scratch.
For ILHf, both the training and test time are lower than if using all features for each hash function. The test runtime for a query is times smaller.
Selecting the number of bits (hash functions) to use has not received much attention in the binary hashing literature. The most obvious way to do this would be to maximize the precision on a test set over (crossvalidation) subject to not exceeding a preset limit (so applying the hash function is fast with test queries). The nesting property of ILH makes this computationally easy: we simply keep adding bits until the test precision stabilizes or decreases, or until we reach the maximum . We can still benefit from parallel processing: if processors are available, we train hash functions in parallel and evaluate their precision, also in parallel. If we still need to increase , we train more hash functions, etc.
We use the following labeled datasets (all using the Euclidean distance in feature space): (1) CIFAR (Krizhevsky, 2009) contains images in classes. We use GIST features (Oliva and Torralba, 2001) from each image. We use images for training and for test. (2) Infinite MNIST (Loosli et al., 2007). We generated, using elastic deformations of the original MNIST handwritten digit dataset, images for training and for test, in classes. We represent each image by a vector of raw pixels. Appendix C contains experiments on additional datasets.
Because of the computational cost of affinitybased methods, previous work has used training sets limited to a few thousand points (Kulis and Darrell, 2009; Liu et al., 2012; Lin et al., 2013; Ge et al., 2014). Unless otherwise indicated, we train the hash functions in a subset of points of the training set, and report precision and recall by searching for a test query on the entire dataset (the base set). As hash functions (for each bit), we use linear SVMs (trained with LIBLINEAR; Fan et al., 2008) and kernel SVMs (with basis functions centered at a random subset of training points).
We report precision and recall for the test set queries using as ground truth (set of true neighbors in original space) all the training points with the same label as the query. The retrieved set contains the nearest neighbors of the query point in the Hamming space. We report precision for different values of to test the robustness of different algorithms.
To understand the effect of diversity, we evaluate the 3 mechanisms ILHi, ILHt and ILHf, and their combination ILHitf, over a range of number of bits (32 to 128) and training set size ( to ). As baseline coupled objective, we use KSH (Liu et al., 2012) but using the same twostep training as ILH: first we find the codes using the alternating mincut method described earlier (initialized from an allones code, and running one iteration of alternating mincut) and then we fit the classifiers. This is faster and generally finds better optima than the original KSH optimization (Lin et al., 2014b). We denote it as KSHcut.
Fig. 1 shows the results. The clearly best diversity mechanism is ILHt, which works better than the other mechanisms, even when combined with them, and significantly better than KSHcut. We explain this as follows. Although all 3 mechanisms introduce diversity, ILHt has a distinct advantage (also over KSHcut): it effectively uses times as much training data, because each hash function has its own disjoint dataset. Using training points in KSHcut would be orders of magnitude slower. ILHt is equal or even better than the combined ILHitf because 1) since there is already enough diversity in ILHt, the extra diversity from ILHi and ILHf does not help; 2) ILHf uses less data (it discards features), which can hurt the precision; this is also seen in fig. 2 (panel 2). The precision of all methods saturates as increases; with bits, ILHt achieves nearly maximum precision with only points. In fact, if we continued to increase the perbit training set size in ILHt, eventually all bits would use the same training set (containing all available data), diversity would disappear and the precision would drop drastically to the precision of using a single bit (%). Practical image retrieval datasets are so large that this is unlikely to occur unless is very large (which would make the optimization too slow anyway).
Linear SVMs are very stable classifiers known to benefit less from ensembles than less stable classifiers such as decision trees or neural nets (Kuncheva, 2014). Remarkably, they strongly benefit from the ensemble in our case. This is because each hash function is solving a different classification problem (different output labels), so the resulting SVMs are in fact quite different from each other. The conclusions for kernel hash functions are similar. We tried two cases: all the hash functions using the same, common
centers for the radial basis functions vs each hash function using its own
centers. Nonlinear classifiers are less stable than linear ones. In our case they do not benefit much more than linear SVMs more from the diversity. They do achieve higher precision since they are more powerful models, particularly when using private centers.Fig. 2 (panels 1–2) shows the results in ILHf of varying the number of features used by each hash function. Intuitively, very low is bad because each classifier receives too little information and will make nearrandom codes. Indeed, for low the precision is comparable to that of LSH (random projections) in fig. 2 (panel 4). Very high will also work badly because it would eliminate the diversity and drop to the precision of a single bit for . This does not actually happen because there is an additional source of diversity: the randomization in the alternating mincut iterations. This has an effect similar to that of ILHi, and indeed a comparable precision. The highest precision is achieved with a proportion % for ILHf, indicating some redundancy in the features. When combined with the other diversity mechanisms (ILHitf, panel 2), the highest precision occurs for , because diversity is already provided by the other mechanisms, and using more data is better.
Fig. 2 (panel 3) shows the results of constructing the training sets for ILHt as a random sample from the base set such that they are “bootstrapped” (sampled with replacement), “disjoint” (sampled without replacement) or “random” (sampled without replacement but reset for each bit, so the training sets may overlap). As expected, “disjoint” (closely followed by “random”) is consistently and significantly better than “bootstrap” because it introduces more independence between the hash functions and learns from more data overall (since each hash function uses the same training set size ).
Fig. 2 (panel 4) shows the precision (in the test set) as a function of the number of bits for ILHt, where the solution for bits is obtained by adding a new bit to the solution for . Since the hash functions obtained depend on the order in which we add the bits, we show 5 such orders (red curves). Remarkably, the precision increases nearly monotonically and continues increasing beyond bits (note the prediction error in bagging ensembles typically levels off after around 25–50 decision trees; Kuncheva, 2014, p. 186). This is (at least partly) because the effective training set size is proportional to
. The variance in the precision decreases as
increases. In contrast, for KSHcut the variance is larger and the precision barely increases after . The higher variance for KSHcut is due to the fact that each value involves training from scratch and we can converge to a relatively different local optimum. As with ILHt, adding LSH random projections (again 5 curves for different orders) increases precision monotonically, but can only reach a low precision at best, since it lacks supervision. We also show the curve for thresholded PCA (tPCA), whose precision tops at around and decreases thereafter. A likely explanation is that highorder principal components essentially capture noise rather than signal, i.e., random variation in the data, and this produces random codes for those bits, which destroy neighborhood information. Bagging tPCA (Leng et al., 2014) does make tPCA improve monotonically with, but the result is still far from competitive. The reason is that there is little diversity among the ensemble members, because the top principal components can be accurately estimated even from small samples. The result in fig.
2 uses tPCA ensembles where each member has 16 principal components, i.e., 16 bits. If using singlebit members, as with ILHt, the precision with bits is barely better than with 1 bit.Is the precision gap between KSH and ILHt due to an incomplete optimization of the KSH objective, or to bad local optima? We verified that 1) random perturbations of the KSHcut optimum lower the precision; 2) optimizing KSHcut using the ILHt codes as initialization (“KSHcutILHt” curve) increases the precision but it still remains far from that of ILHt. This confirms that the optimization algorithm is doing its job, and that the ILHt diversity mechanism is superior to coupling the hash functions in a joint objective.
The result of learning binary hashing is hash functions, represented by a matrix of real weights for linear SVMs, and a matrix of binary () codes for the entire dataset. We define a measure of code orthogonality as follows. Define matrices for the codes and for the weights (assuming normalized SVM weights). Each matrix has entries in , equal to a normalized dot product of codes or weight vectors, and diagonal entries equal to . (Note that any matrix where SS is diagonal with entries is equivalent, since reverting a hash function’s output does not alter the Hamming distances.) Perfect orthogonality happens when , and is encouraged (explicitly or not) by many binary hashing methods.
Fig. 3 shows this for ILHt in CIFAR ( training points of dimension ) and Infinite MNIST ( training points of dimension ). It plots and as an image, as well as the histogram of the entries of and . The histograms also contain, as a control, the histogram corresponding to normalized dot products of random vectors (of dimension or , respectively), which is known to tend to a delta function at 0 as the dimension grows. Although has some tendency to orthogonality as the number of bits used increases, it is clear that, for both codes and weight vectors, the distribution of dot products is wide and far from strict orthogonality. Hence, enforcing orthogonality does not seem necessary to achieve good hash functions and codes.
histogram  

CIFAR 
bin[t][]entries of  

bin[t][]entries of  
Infinite MNIST 
bin[t][]entries of  

bin[t][]entries of 
We compare with both the original KSH (Liu et al., 2012) and its mincut optimization KSHcut (Lin et al., 2014b), and a representative subset of affinitybased and unsupervised hashing methods: Supervised Binary Reconstructive Embeddings (BRE) (Kulis and Darrell, 2009), Supervised SelfTaught Hashing (STH) (Zhang et al., 2010), Spectral Hashing (SH) (Weiss et al., 2009), Iterative Quantization (ITQ) (Gong et al., 2013)
, Binary Autoencoder (BA)
(CarreiraPerpiñán and Raziperchikolaei, 2015), thresholded PCA (tPCA), and LocalitySensitive Hashing (LSH) (Andoni and Indyk, 2008). We create affinities for all the affinitybased methods using the dataset labels. For each training point , we use as similar neighbors points with the same labels as ; and as dissimilar neighbors points chosen randomly among the points whose labels are different from that of . For all datasets, all the methods are trained using a subset of points. Given that KSHcut already performs well (Lin et al., 2014b) and that ILHt consistently outperforms it both in precision and runtime, we expect ILHt to be competitive with the stateoftheart. Fig. 4 shows this is generally the case, particularly as the number of bits increases, when ILHt beats all other methods, which are not able to increase precision as much as ILHt does.The runtime to train a single ILHt hash function (in a single processor) for CIFAR is as follows:
Number of points  

Time in seconds 
This is much faster than other affinitybased hashing methods (for example, for bits with points, BRE did not converge after hours). KSHcut is among the faster methods. Its runtime per mincut pass over a single bit is comparable to ours, but it needs sequential passes to complete just one alternating optimization iteration, while our functions can be trained in parallel.
ILHt achieves a remarkably high precision compared to a coupled KSH objective using the same optimization algorithm but introducing diversity by feeding different data to independent hash functions rather than by jointly optimizing over them. It also compares well with stateoftheart methods in precision/recall, being competitive if few bits are used and the clear winner as more bits are used, and is very fast and embarrassingly parallel.
We have revealed for the first time a connection between supervised binary hashing and ensemble learning that could open the door to many new hashing algorithms. Although we have focused on a specific objective (Laplacian) and identified as particularly successful with it a specific diversity mechanism (disjoint training sets), other choices may be better depending on the application. The core idea we propose is the independent training of the hash functions via the introduction of diversity by means other than coupling terms in the objective or constraints. This may come as a surprise in the area of learning binary hashing, where most work in the last few years has focused on proposing complex objective functions that couple all hash functions and developing sophisticated optimization algorithms for them.
Another surprise is that orthogonality of the codes or hash functions seems unnecessary. ILHt creates codes and hash functions that do differ from each other but are far from being orthogonal, yet they achieve good precision that keeps growing as we add bits. Thus, introducing diversity through different training data seems a better mechanism to make hash functions differ than coupling the codes through an orthogonality constraint or otherwise. It is also far simpler and faster to train independent singlebit hash functions.
A final surprise is that the wide variety of affinitybased objective functions in the bit case reduces to a binary quadratic problem in the 1bit case regardless of the form of the bit objective (as long as it depends on Hamming distances only). In this sense, there is a unique objective in the 1bit case.
There has been a prior attempt to use bagging (bootstrapped samples) with truncated PCA (Leng et al., 2014). Our experiments show that, while this improves truncated PCA, it performs poorly in supervised binary hashing. This is because PCA is unsupervised and does not use the userprovided similarity information, which may disagree with Euclidean distances in image space; and because estimating principal components from samples has low diversity. Also, PCA is computationally simple and there is little gain by bagging it, unlike the far more difficult optimization of supervised binary hashing.
Some supervised binary hashing work (Liu et al., 2012; Wang et al., 2012) has proposed to learn the hash functions sequentially, where the th function has an orthogonalitylike constraint to force it to differ from the previous functions. Hence, this does not learn the functions independently and can be seen as a greedy optimization of a joint objective over all functions.
Binary hashing does differ from ensemble learning in one important point: the predictions of the classifiers (= hash functions) are not combined into a single prediction, but are instead concatenated into a binary vector (which can take possible values). The “labels” (the binary codes) for the “classifiers” (the hash functions) are unknown, and are implicitly or explicitly learned together with the hash functions themselves. This means that wellknown error decompositions such as the errorambiguity decomposition (Krogh and Vedelsby, 1995) and the biasvariance decomposition (Geman et al., 1992) do not apply. Also, the real goal of binary hashing is to do well in information retrieval measures such as precision and recall, but hash functions do not directly optimize this. A theoretical understanding of why diversity helps in learning binary hashing is an important topic of future work.
In this respect, there is also a relation with errorcorrecting output codes (ECOC) (Dietterich and Bakiri, 1995), an approach for multiclass classification. In ECOC, we represent each of the classes with a bit binary vector, ensuring that is large enough for the vectors to be sufficiently separated in Hamming distance. Each bit corresponds to partitioning the classes into two groups. We then train binary classifiers, such as decision trees. Given a test pattern, we output as class label the one closest in Hamming distance to the bit output of the classifiers. The redundant errorcorrecting codes allow for small errors in the individual classifiers and can improve performance. An ECOC can also be seen as an ensemble of classifiers where we manipulate the output targets (rather than the input features or training set) to obtain each classifier, and we apply majority vote on the final result (if the test output in classifier is 1, then all classes associated with 1 get a vote). The main benefit of ECOC seems to be in variance reduction, as in other ensemble methods (James and Hastie, 1998). Binary hashing can be seen as an ECOC with classes, one per training point, with the ECOC prediction for a test pattern (query) being the nearestneighbor class codes in Hamming distance. However, unlike in ECOC, the binary hashing the codes are learned so they preserve neighborhood relations between training points. Also, while ideally all codes should be different (since a collision makes two originally different patterns indistinguishable, which will degrade some searches), this is not guaranteed in binary hashing.
A final, different example shows the important role of diversity, i.e., making the hash functions differ, in learning good hash functions. Some binary hashing methods optimize an objective essentially of the following form (Rastegari et al., 2015; Xia et al., 2015):
(6) 
where is a linear projection matrix of . The idea is to force the projections to be as close as possible to binary values. The orthogonality constraint ensures that trivial solutions (which would make all hash functions equal) are not optimal. Remarkably, the objective function (6) contains no explicit information about neighborhood preservation (as in affinitybased loss functions) or reconstruction of the input (as in autoencoders). Although orthogonal projections preserve Euclidean distances, this is not true if preserving only a few, binarized projections. Yet this can produce good hash functions if initialized from PCA or ITQ, which did learn projections that try to reconstruct the inputs optimally, and a local optimum of the (NPcomplete) objective (6) may not be far from that. Thus, it would appear that part of the success of these approaches relies on the constraint providing a form of diversity among the hash functions.
Much work in supervised binary hashing has focused on designing sophisticated objective functions of the hash functions that force them to compete with each other while trying to preserve neighborhood information. We have shown, surprisingly, that training hash functions independently is not just simpler, faster and parallel, but also can achieve better retrieval quality, as long as diversity is introduced into each hash function’s objective function. This establishes a connection with ensemble learning and allows one to borrow techniques from it. We showed that having each hash function optimize a Laplacian objective on a disjoint subset of the data works well, and facilitates selecting the number of bits to use. Although our evidence is mostly empirical, the intuition behind it is sound and in agreement with the many results (also mostly empirical) showing the power of ensemble classifiers. The ensemble learning perspective suggests many ideas for future work, such as pruning a large ensemble or using other diversity techniques. It may also be possible to characterize theoretically the performance in precision of binary hashing depending on the diversity of the hash functions.
In the main paper, we state that, in the single bit case (), the Laplacian, KSH and BRE loss functions over the vector of binary codes for each data point can be written in the form of a binary quadratic function without linear term (or a MRF with quadratic potentials only):
(7) 
with an appropriate, datadependent neighborhood symmetric matrix of . We can assume w.l.o.g. that , i.e., the diagonal elements of are zero, since any diagonal values simply add a constant to .
More generally, consider an arbitrary objective function of a binary vector that has the form and which only depends on Hamming distances between bits , . This is the form of the affinitybased loss function used in many binary hashing papers, in the singlebit case. Each term of the function can be written as . This fact, already noted by Lin et al. (2013), is because a function of 2 binary variables can take 4 different values:
but if only depends on the Hamming distance of and then we have and . This can be achieved by , and the constant can be ignored when optimizing.
By a similar argument we can prove that an arbitrary function of 3 binary variables that depends only on their Hamming distances can be written as a quadratic function of the 3 variables.
However, this is not true in general. This can be seen by comparing the dimensions of the function spaces spanned by the arbitrary function and the quadratic function. Consider first a general quadratic function of binary variables . We can always take symmetric (because ) and absorb its diagonal terms into the constant (because ), so we can write w.l.o.g. . This has free parameters. The vector of possible values of for all possible binary vectors is a linear function of these free parameters, Hence, the dimension of the space of all quadratic functions is at most . Consider now an arbitrary function of binary variables that depends only on their Hamming distances. Although there are Hamming distances , they are all determined just by the first distances for . This is because, given , the distance determines for each and so the entire vector and all the other distances. Also, given the distances for , the value produces a vector whose bits are reversed from that produced by , so both have the same Hamming distances. Hence, we have free binary variables (the values of for ), which determine the vector of possible values of for all possible binary vectors . Hence, the dimension of the space of all arbitrary functions of Hamming distances is . Since for , the quadratic functions in general cannot represent all arbitrary binary functions of the Hamming distances using the same binary variables.
Finally, note that some objective functions which make sense in the bit case with become trivial in the singlebit case. For example, the loss function for Minimal Loss Hashing (Norouzi and Fleet, 2011):
uses a hinge loss to implement the goal that similar points (having ) should differ by no more than bits and dissimilar points (having ) should differ by bits or more, where , , and is the Hamming distance between and . It is easy to see that in the singlebit case the loss becomes constant, independent of the codes—because using one bit the Hamming distance can be either 0 or 1 only.
In paragraph Are the codes orthogonal? of the main paper, we define a measure of orthogonality for either the binary codes or the hash function weight vectors , based on the matrices of normalized dot products, and (where the rows of are normalized), respectively. Here we prove several statements we make in that paragraph.
Given a matrix of (either or ) with entries in , define as measure of orthogonality (where is the Frobenius norm):
(8) 
That is, is the average of the squared offdiagonal elements of .
is independent of sign reversals of the hash functions.
Let SS be a diagonal matrix with diagonal entries . SS satisfies so it is orthogonal. Hence, . ∎
As control hypothesis for the orthogonality of the binary codes or hash function vectors we used the distribution of dot products of random vectors. Here we give their mean and variance explicitly as a function of their dimension.
Let be two random binary vectors of independent components, where take the value
with probability
. Let . Then and .Let . Clearly, takes the value with probability , so its mean is and its variance is , and are iid. Hence, using standard properties of the expectation and variance, we have that , and . (Furthermore, is Bernoulli and is binomial.) ∎
It is also possible to prove that, for random unit vectors of dimension with real components, their dot product has mean and variance .
Hence, as the dimension increases, the variance decreases, and the distribution of tends to a delta at . This means that random highdimensional vectors are practically orthogonal. The “random” histograms (black line) in fig. 3 are based on a sample of random vectors (for , we sample the component of each weight vector uniformly in and then normalize the vector). They follow the theoretical distribution well.
In fig. 5 we also include results for an additional, unsupervised dataset, the Flickr 1 million image dataset (Huiskes et al., 2010). For Flickr, we randomly select images for test and the rest for training. We use MPEG7 edge histogram features. Since no labels are available, we create pseudolabels for by declaring as similar points its true nearest neighbors (using the Euclidean distance) and as dissimilar points a random subset of points among the remaining points. As ground truth, we use the nearest neighbors of the query in Euclidean space. All hash functions are trained using points. Retrieved set: nearest neighbors of the query point in the Hamming space, for a range of .
The only important difference is that LocalitySensitive Hashing (LSH) achieves a high precision in the Flickr dataset, considerably higher than that of KSHcut. This is understandable, for the following reasons: 1) Flickr is an unsupervised dataset, and the neighborhood information provided to KSHcut (and ILHt) in the form of affinities is limited to the small subset of positive and negative neighbors , while LSH has access to the full feature vector of every image. 2) The dimensionality of the Flickr feature vectors is quite small: . Still, ILHt beats LSH by a significant margin.
In addition to the methods we used in the supervised datasets, we compare ILHt with Spectral Hashing (SH) (Weiss et al., 2009), Iterative Quantization (ITQ) (Gong et al., 2013), Binary Autoencoder (BA) (CarreiraPerpiñán and Raziperchikolaei, 2015), thresholded PCA (tPCA), and LocalitySensitive Hashing (LSH) (Andoni and Indyk, 2008). Again, ILHt beats all other stateoftheart methods, or is comparable to the best of them, particularly as the number of bits increases.
Work supported by NSF award IIS–1423515.
Proc. of the 2015 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’15)
, pages 557–566, Boston, MA, June 7–12 2015.Proc. of the 17th Int. Conf. Artificial Intelligence and Statistics (AISTATS 2014)
, pages 10–19, Reykjavik, Iceland, Apr. 22–25 2014.Exact maximum a posteriori estimation for binary images.
Journal of the Royal Statistical Society, B, 51(2):271–279, 1989.Neural network ensembles, cross validation, and active learning.
In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems (NIPS), volume 7, pages 231–238. MIT Press, Cambridge, MA, 1995.Fast supervised hashing with decision trees for highdimensional data.
In Proc. of the 2014 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’14), pages 1971–1978, Columbus, OH, June 23–28 2014b.Training invariant support vector machines using selective sampling.
In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, Neural Information Processing Series, pages 301–320. MIT Press, 2007.Multiclass spectral clustering.
In Proc. 9th Int. Conf. Computer Vision (ICCV’03), pages 313–319, Nice, France, Oct. 14–17 2003.
Comments
There are no comments yet.