An ensemble diversity approach to supervised binary hashing

02/04/2016 ∙ by Miguel Á. Carreira-Perpiñán, et al. ∙ 0

Binary hashing is a well-known approach for fast approximate nearest-neighbor search in information retrieval. Much work has focused on affinity-based objective functions involving the hash functions or binary codes. These objective functions encode neighborhood information between data points and are often inspired by manifold learning algorithms. They ensure that the hash functions differ from each other through constraints or penalty terms that encourage codes to be orthogonal or dissimilar across bits, but this couples the binary variables and complicates the already difficult optimization. We propose a much simpler approach: we train each hash function (or bit) independently from each other, but introduce diversity among them using techniques from classifier ensembles. Surprisingly, we find that not only is this faster and trivially parallelizable, but it also improves over the more complex, coupled objective function, and achieves state-of-the-art precision and recall in experiments with image retrieval.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 14

Code Repositories

NIPS2016

This project collects the different accepted papers and their link to Arxiv or Gitxiv


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and related work

Information retrieval tasks such as searching for a query image or document in a database are essentially a nearest-neighbor search (Shakhnarovich et al., 2006). When the dimensionality of the query and the size of the database is large, approximate search is necessary. We focus on binary hashing (Grauman and Fergus, 2013)

, where the query and database are mapped onto low-dimensional binary vectors, where the search is performed. This has two speedups: computing Hamming distances (with hardware support) is much faster than computing distances between high-dimensional floating-point vectors; and the entire database becomes much smaller, so it may reside in fast memory rather than disk (for example, a database of 1 billion real vectors of dimension 500 takes 2 TB in floating point but 8 GB as 64-bit codes).

Constructing hash functions that do well in retrieval measures such as precision and recall is usually done by optimizing an affinity-based objective function that relates Hamming distances to supervised neighborhood information in a training set. Many such objective functions have the form of a sum of pairwise terms that indicate whether the training points and are neighbors:

Here, is the dataset of high-dimensional feature vectors (e.g., SIFT features of an image), are binary hash functions and is the -bit code vector for input , means minimizing over the parameters of the hash function  (e.g. over the weights of a linear SVM), and

is a loss function that compares the codes for two images (often through their Hamming distance

) with the ground-truth value that measures the affinity in the original space between the two images and (distance, similarity or other measure of neighborhood). The sum is often restricted to a subset of image pairs (for example, within the nearest neighbors of each other in the original space), to keep the runtime low. The output of the algorithm is the hash function  and the binary codes for the training points, where for . Examples of these objective functions are Supervised Hashing with Kernels (KSH) (Liu et al., 2012), Binary Reconstructive Embeddings (BRE) (Kulis and Darrell, 2009) and the binary Laplacian loss (an extension of the Laplacian Eigenmaps objective; Belkin and Niyogi, 2003):

(1)
(2)
(3)

where for KSH is if , are similar and if they are dissimilar; for BRE (where the dataset  is scaled or normalized so the Euclidean distances are in ); and for the Laplacian loss if , are similar and if they are dissimilar (“positive” and “negative” neighbors). Other examples of these objective functions include models developed for dimension reduction, be they spectral such as Locally Linear Embedding (Roweis and Saul, 2000) or Anchor Graphs (Liu et al., 2011), or nonlinear such as the Elastic Embedding (Carreira-Perpiñán, 2010) or -SNE (van der Maaten and Hinton, 2008); as well as objective functions designed specifically for binary hashing, such as Semi-supervised sequential Projection Learning Hashing (SPLH) (Wang et al., 2012). They all can produce good hash functions. We will focus on the Laplacian loss in this paper.

In designing these objective functions, one needs to eliminate two types of trivial solutions. 1) In the Laplacian loss, mapping all points to the same code, i.e., , is the global optimum of the positive neighbors term (this also arises if the codes are real-valued, as in Laplacian eigenmaps). This can be avoided by having negative neighbors. 2) Having all hash functions (all bits of each vector) being identical to each other, i.e., for each . This can be avoided by introducing constraints, penalty terms or other mathematical devices that couple the bit dimensions. For example, in the Laplacian loss (3) we can encourage codes to be orthogonal through a constraint (Weiss et al., 2009) or a penalty term

(the latter requiring a hyperparameter that controls the weight of the penalty)

(Ge et al., 2014), although this generates dense matrices of . In the KSH or BRE losses (1), squaring the dot product or Hamming distance between the codes couples the bits.

An important downside of these approaches is the difficulty of their optimization. This is due to the fact that the objective function is nonsmooth (implicitly discrete) because of the binary output of the hash function. There is a large number of such binary variables (), a larger number of pairwise interactions (, less if using sparse neighborhoods) and the variables are coupled by the said constraints or penalty terms. The optimization is approximated in different ways. Most papers ignore the binary nature of the

 codes and optimize over them as real values, then binarize them by truncation (possibly with an optimal rotation;

Yu and Shi, 2003; Gong et al., 2013), and finally fit a classifier (e.g. linear SVM) to each of the bits separately. For example, for the Laplacian loss with constraints this involves solving an eigenproblem on  as in Laplacian eigenmaps (Belkin and Niyogi, 2003; Weiss et al., 2009; Zhang et al., 2010), or approximated using landmarks (Liu et al., 2011). This is fast, but relaxing the codes in the optimization is generally far from optimal. Some recent papers try to respect the binary nature of the codes during their optimization, using techniques such as alternating optimization, min-cut and GraphCut (Boykov et al., 2001; Lin et al., 2014b; Ge et al., 2014) or others (Lin et al., 2013), and then fit the classifiers, or use alternating optimization directly on the hash function parameters (Liu et al., 2012). Even more recently, one can optimize jointly over the binary codes and hash functions (Ge et al., 2014; Carreira-Perpiñán and Raziperchikolaei, 2015; Raziperchikolaei and Carreira-Perpiñán, 2015). Most of these approaches are slow and limited to small datasets (a few thousand points) because of the quadratic number of pairwise terms in the objective.

We propose a different, much simpler approach. Rather than coupling the hash functions into a single objective function, we train each hash function independently from each other and using a single-bit objective function of the same form. We show that we can avoid trivial solutions by injecting diversity into each hash function’s training using techniques inspired from classifier ensemble learning. Section 2 discusses relevant ideas from the ensemble learning literature, section 3 describes our independent Laplacian hashing algorithm, section 4 gives evidence with image retrieval datasets that this simple approach indeed works very well, and section 5 further discusses the connection between hashing and ensembles.

2 Ideas from learning classifier ensembles

At first sight, optimizing (3) without constraints does not seem like a good idea: since separates over the bits, we obtain independent identical objectives, one over each hash function, and so they all have the same global optimum. And, if all hash functions are equal, they are equivalent to using just one of them, which will give a much lower precision/recall. In fact, the very same issue arises when training an ensemble of classifiers (Dietterich, 2000; Zhou, 2012; Kuncheva, 2014). Here, we have a training set of input vectors and output class labels, and want to train several classifiers whose outputs are then combined (usually by majority vote). If the classifiers are all equal, we gain nothing over a single classifier. Hence, it is necessary to introduce diversity among the classifiers so that they disagree in their predictions. The ensemble learning literature has identified several mechanisms to inject diversity. The most important ones that apply to our binary hashing setting are as follows:

Using different data for each classifier

This can be done by: 1) Using different feature subsets for each classifier. This works best if the features are somewhat redundant. 2) Using different training sets for each classifier. This works best for unstable algorithms (whose resulting classifier is sensitive to small changes in the training data), such as decision trees or neural nets, and unlike linear or nearest neighbor classifiers. A prominent example is bagging

(Breiman, 1996), which generates bootstrap datasets and trains a model on each.

Injecting randomness in the training algorithm

This is only possible if local optima exist (as for neural nets) or if the algorithm is randomized (as for decision trees). This can be done by using different initializations, adding noise to the updates or using different choices in the randomized operations in the algorithm (e.g. the choice of split in decision trees, as in random forests;

Breiman, 2001).

Using different classifier models

For example, different parameters (e.g. the number of neighbors in a nearest-neighbor classifier), different architectures (e.g. neural nets with different number of layers or hidden units), or different types of classifiers altogether.

There are other variations in addition to these techniques, as well as combinations of them.

3 Independent Laplacian Hashing (ILH) with diversity

The connection of binary hashing with ensemble learning offers many possible options, in terms of the choice of type of hash function (“base learner”), binary hashing (single-bit) objective function, optimization algorithm, and diversity mechanism. In this paper we focus on the following choices. We use linear and kernel SVMs as hash functions. Without loss of generality (see later), we use the Laplacian objective (3), which for a single bit takes the form

(4)

To optimize it, we use a two-step approach, where we first optimize (4) over the bits and then learn the hash function by fitting to it a binary classifier. (It is also possible to optimize over the hash function directly with the method of auxiliary coordinates; Carreira-Perpiñán and Raziperchikolaei, 2015; Raziperchikolaei and Carreira-Perpiñán, 2015, which essentially iterates over optimizing (4) and fitting the classifier.) The Laplacian objective (4) is NP-complete if we have negative neighbors (i.e., some ). We approximately optimize it using a min-cut algorithm (as implemented by Boykov et al., 2001) applied in alternating fashion to submodular blocks as described in Lin et al. (2014a). This first partitions the points into disjoint groups containing only nonnegative weights. Each group defines a submodular function (specifically, quadratic with nonpositive coefficients) whose global minimum can be found in polynomial time using min-cut. The order in which the groups are optimized over is randomized at each iteration (this improves over using a fixed order). The approximate optimizer found depends on the initial .

Finally, we consider three types of diversity mechanism (as well as their combination):

Different initializations (ILHi)

Each hash function is initialized from a random -bit vector .

Different training sets (ILHt)

Each hash function uses a training set of points that is different and (if possible) disjoint from that of other hash functions. We can afford to do this because in binary hashing the training sets are potentially very large, and the computational cost of the optimization limits the training sets to a few thousand points. Later we show this outperforms using bootstrapped training sets.

Different feature subsets (ILHf)

Each hash function is trained on a random subset of features sampled without replacement (so the features are distinct). The subsets corresponding to different hash functions may overlap.

These mechanisms are applicable to other objective functions beyond (4). We could also use the same training set but construct differently the weight matrix in (4) (e.g. using different numbers of positive and negative neighbors).

Equivalence of objective functions in the single-bit case

Several binary hashing objective functions that differ in the general case of bits become essentially identical in the case. For example, expanding the pairwise terms in (1)–(3) (noting that if ):

So the Laplacian and KSH objectives are in fact identical, and all three can be written in the form of a binary quadratic function without linear term (or a Markov random field with quadratic potentials only):

(5)

with an appropriate, data-dependent neighborhood symmetric matrix  of . This problem is NP-complete in general (Garey and Johnson, 1979; Boros and Hammer, 2002; Kolmogorov and Zabih, 2003), when  has both positive and negative elements, as well as zeros. It is submodular if  has only nonpositive elements, in which case it is equivalent to a min-cut/max-flow problem and it can be solved in polynomial time (Boros and Hammer, 2002; Greig et al., 1989).

More generally, any objective function of a binary vector  that has the form and which only depends on Hamming distances between bits can be written as . Even more, an arbitrary function of 3 binary variables that depends only on their Hamming distances can be written as a quadratic function of the 3 variables. However, for 4 variables or more this is not generally true (see appendix A).

Computational advantages

Training the hash functions independently has some important advantages. First, training the functions can be parallelized perfectly. This is a speedup of one to two orders of magnitude for typical values of (32 to 200 in our experiments). Coupled objective functions such as KSH do not exhibit obvious parallelism, because they are trained with alternating optimization, which is inherently sequential.

Second, even in a single processor, binary optimizations over variables each is generally easier than one binary optimization over variables. This is so because the search spaces contain and states, resp., so enumeration is much faster in the independent case (even though it is still impractical). If using an approximate polynomial-time algorithm, the independent case is also faster if the runtime is superlinear on the number of variables: the asymptotic runtimes will be and with , respectively. This is the case for the best practical GraphCut (Boykov et al., 2001) and max-flow/min-cut algorithms (Cormen et al., 2009).

Third, the solution exhibits “nesting”, that is, to get the solution for bits we just need to take a solution with bits and add one more bit (as happens with PCA). This is unlike most methods based on a coupled objective function (such as KSH), where the solution for bits cannot be obtained by adding one more bit, we have to solve for bits from scratch.

For ILHf, both the training and test time are lower than if using all features for each hash function. The test runtime for a query is times smaller.

Model selection for the number of bits

Selecting the number of bits (hash functions) to use has not received much attention in the binary hashing literature. The most obvious way to do this would be to maximize the precision on a test set over (cross-validation) subject to not exceeding a preset limit (so applying the hash function is fast with test queries). The nesting property of ILH makes this computationally easy: we simply keep adding bits until the test precision stabilizes or decreases, or until we reach the maximum . We can still benefit from parallel processing: if processors are available, we train hash functions in parallel and evaluate their precision, also in parallel. If we still need to increase , we train more hash functions, etc.

4 Experiments

We use the following labeled datasets (all using the Euclidean distance in feature space): (1) CIFAR (Krizhevsky, 2009) contains images in classes. We use GIST features (Oliva and Torralba, 2001) from each image. We use images for training and for test. (2) Infinite MNIST (Loosli et al., 2007). We generated, using elastic deformations of the original MNIST handwritten digit dataset, images for training and for test, in classes. We represent each image by a vector of raw pixels. Appendix C contains experiments on additional datasets.

Because of the computational cost of affinity-based methods, previous work has used training sets limited to a few thousand points (Kulis and Darrell, 2009; Liu et al., 2012; Lin et al., 2013; Ge et al., 2014). Unless otherwise indicated, we train the hash functions in a subset of points of the training set, and report precision and recall by searching for a test query on the entire dataset (the base set). As hash functions (for each bit), we use linear SVMs (trained with LIBLINEAR; Fan et al., 2008) and kernel SVMs (with basis functions centered at a random subset of training points).

We report precision and recall for the test set queries using as ground truth (set of true neighbors in original space) all the training points with the same label as the query. The retrieved set contains the nearest neighbors of the query point in the Hamming space. We report precision for different values of to test the robustness of different algorithms.

Diversity mechanisms with ILH

iteration[][]iterations nTrain 32bits=32 64bits=64 128bits=128 ILHi ILHt ILHf ILHitf KSHcut linear kernel ker. , centers nTrain[][] nTrain[][] nTrain[][] nTrain[][] nTrain[][]

Figure 1: Diversity mechanisms vs baseline (KSHcut). Precision on CIFAR dataset, as a function of the training set size ( to ) and number of bits ( to ). Ground truth: all points with the same label as the query. Retrieved set: nearest neighbors of the query. Errorbars shown only for ILHt (over 5 random training sets) to avoid clutter. Top to bottom: the hash functions are linear, kernel and kernel with private centers. Left to right: ILH diversity mechanisms and their combination, and the baseline KSHcut.

To understand the effect of diversity, we evaluate the 3 mechanisms ILHi, ILHt and ILHf, and their combination ILHitf, over a range of number of bits (32 to 128) and training set size ( to ). As baseline coupled objective, we use KSH (Liu et al., 2012) but using the same two-step training as ILH: first we find the codes using the alternating min-cut method described earlier (initialized from an all-ones code, and running one iteration of alternating min-cut) and then we fit the classifiers. This is faster and generally finds better optima than the original KSH optimization (Lin et al., 2014b). We denote it as KSHcut.

Fig. 1 shows the results. The clearly best diversity mechanism is ILHt, which works better than the other mechanisms, even when combined with them, and significantly better than KSHcut. We explain this as follows. Although all 3 mechanisms introduce diversity, ILHt has a distinct advantage (also over KSHcut): it effectively uses times as much training data, because each hash function has its own disjoint dataset. Using training points in KSHcut would be orders of magnitude slower. ILHt is equal or even better than the combined ILHitf because 1) since there is already enough diversity in ILHt, the extra diversity from ILHi and ILHf does not help; 2) ILHf uses less data (it discards features), which can hurt the precision; this is also seen in fig. 2 (panel 2). The precision of all methods saturates as increases; with bits, ILHt achieves nearly maximum precision with only points. In fact, if we continued to increase the per-bit training set size in ILHt, eventually all bits would use the same training set (containing all available data), diversity would disappear and the precision would drop drastically to the precision of using a single bit (%). Practical image retrieval datasets are so large that this is unlikely to occur unless is very large (which would make the optimization too slow anyway).

Linear SVMs are very stable classifiers known to benefit less from ensembles than less stable classifiers such as decision trees or neural nets (Kuncheva, 2014). Remarkably, they strongly benefit from the ensemble in our case. This is because each hash function is solving a different classification problem (different output labels), so the resulting SVMs are in fact quite different from each other. The conclusions for kernel hash functions are similar. We tried two cases: all the hash functions using the same, common

centers for the radial basis functions vs each hash function using its own

centers. Nonlinear classifiers are less stable than linear ones. In our case they do not benefit much more than linear SVMs more from the diversity. They do achieve higher precision since they are more powerful models, particularly when using private centers.

Fig. 2 (panels 1–2) shows the results in ILHf of varying the number of features used by each hash function. Intuitively, very low is bad because each classifier receives too little information and will make near-random codes. Indeed, for low the precision is comparable to that of LSH (random projections) in fig. 2 (panel 4). Very high will also work badly because it would eliminate the diversity and drop to the precision of a single bit for . This does not actually happen because there is an additional source of diversity: the randomization in the alternating min-cut iterations. This has an effect similar to that of ILHi, and indeed a comparable precision. The highest precision is achieved with a proportion % for ILHf, indicating some redundancy in the features. When combined with the other diversity mechanisms (ILHitf, panel 2), the highest precision occurs for , because diversity is already provided by the other mechanisms, and using more data is better.

dD L 32bits 64bits 128bits ILHf ILHitf ILHt: training set sampling Incremental ILHt precision, CIFAR precision, Inf.MNIST dD[][B] dD[][B] L[][B]number of bits L[][B]number of bits

Figure 2: Panels 1–2: effect of the proportion of features used in ILHf and ILHitf. Panel 3: bootstrap vs random vs disjoint training sets in ILHt (disjoint is not feasible for CIFAR, as it is not large enough). Panel 4: precision as a function of the number of hash functions for different methods (for ILHt and LSH we show 5 curves, each one random ordering of the bits). All results show precision using a training set of points. Errorbars over 5 random training sets. Ground truth: all points with the same label as the query. Retrieved set: nearest neighbors of the query, where for CIFAR (top) and for Infinite MNIST (bottom).

Fig. 2 (panel 3) shows the results of constructing the training sets for ILHt as a random sample from the base set such that they are “bootstrapped” (sampled with replacement), “disjoint” (sampled without replacement) or “random” (sampled without replacement but reset for each bit, so the training sets may overlap). As expected, “disjoint” (closely followed by “random”) is consistently and significantly better than “bootstrap” because it introduces more independence between the hash functions and learns from more data overall (since each hash function uses the same training set size ).

Precision as a function of

Fig. 2 (panel 4) shows the precision (in the test set) as a function of the number of bits for ILHt, where the solution for bits is obtained by adding a new bit to the solution for . Since the hash functions obtained depend on the order in which we add the bits, we show 5 such orders (red curves). Remarkably, the precision increases nearly monotonically and continues increasing beyond bits (note the prediction error in bagging ensembles typically levels off after around 25–50 decision trees; Kuncheva, 2014, p. 186). This is (at least partly) because the effective training set size is proportional to

. The variance in the precision decreases as

increases. In contrast, for KSHcut the variance is larger and the precision barely increases after . The higher variance for KSHcut is due to the fact that each value involves training from scratch and we can converge to a relatively different local optimum. As with ILHt, adding LSH random projections (again 5 curves for different orders) increases precision monotonically, but can only reach a low precision at best, since it lacks supervision. We also show the curve for thresholded PCA (tPCA), whose precision tops at around and decreases thereafter. A likely explanation is that high-order principal components essentially capture noise rather than signal, i.e., random variation in the data, and this produces random codes for those bits, which destroy neighborhood information. Bagging tPCA (Leng et al., 2014) does make tPCA improve monotonically with

, but the result is still far from competitive. The reason is that there is little diversity among the ensemble members, because the top principal components can be accurately estimated even from small samples. The result in fig. 

2 uses tPCA ensembles where each member has 16 principal components, i.e., 16 bits. If using single-bit members, as with ILHt, the precision with bits is barely better than with 1 bit.

Is the precision gap between KSH and ILHt due to an incomplete optimization of the KSH objective, or to bad local optima? We verified that 1) random perturbations of the KSHcut optimum lower the precision; 2) optimizing KSHcut using the ILHt codes as initialization (“KSHcut-ILHt” curve) increases the precision but it still remains far from that of ILHt. This confirms that the optimization algorithm is doing its job, and that the ILHt diversity mechanism is superior to coupling the hash functions in a joint objective.

Are the codes orthogonal?

The result of learning binary hashing is hash functions, represented by a matrix of real weights for linear SVMs, and a matrix of binary () codes for the entire dataset. We define a measure of code orthogonality as follows. Define matrices for the codes and for the weights (assuming normalized SVM weights). Each  matrix has entries in , equal to a normalized dot product of codes or weight vectors, and diagonal entries equal to . (Note that any matrix where SS is diagonal with entries is equivalent, since reverting a hash function’s output does not alter the Hamming distances.) Perfect orthogonality happens when , and is encouraged (explicitly or not) by many binary hashing methods.

Fig. 3 shows this for ILHt in CIFAR ( training points of dimension ) and Infinite MNIST ( training points of dimension ). It plots and as an image, as well as the histogram of the entries of and . The histograms also contain, as a control, the histogram corresponding to normalized dot products of random vectors (of dimension or , respectively), which is known to tend to a delta function at 0 as the dimension grows. Although has some tendency to orthogonality as the number of bits used increases, it is clear that, for both codes and weight vectors, the distribution of dot products is wide and far from strict orthogonality. Hence, enforcing orthogonality does not seem necessary to achieve good hash functions and codes.

histogram
   

CIFAR

bin[t][]entries of
   

bin[t][]entries of
   

Infinite MNIST

bin[t][]entries of
   

bin[t][]entries of
Figure 3: Orthogonality of codes ( matrix and histogram, upper plots) and of hash function weight vectors ( matrix and histogram, lower plots) in different datasets. Both matrices and are of where is the number of bits (i.e., the number of hash functions).

Comparison with other binary hashing methods

We compare with both the original KSH (Liu et al., 2012) and its min-cut optimization KSHcut (Lin et al., 2014b), and a representative subset of affinity-based and unsupervised hashing methods: Supervised Binary Reconstructive Embeddings (BRE) (Kulis and Darrell, 2009), Supervised Self-Taught Hashing (STH) (Zhang et al., 2010), Spectral Hashing (SH) (Weiss et al., 2009), Iterative Quantization (ITQ) (Gong et al., 2013)

, Binary Autoencoder (BA)

(Carreira-Perpiñán and Raziperchikolaei, 2015), thresholded PCA (tPCA), and Locality-Sensitive Hashing (LSH) (Andoni and Indyk, 2008). We create affinities for all the affinity-based methods using the dataset labels. For each training point , we use as similar neighbors points with the same labels as ; and as dissimilar neighbors points chosen randomly among the points whose labels are different from that of . For all datasets, all the methods are trained using a subset of points. Given that KSHcut already performs well (Lin et al., 2014b) and that ILHt consistently outperforms it both in precision and runtime, we expect ILHt to be competitive with the state-of-the-art. Fig. 4 shows this is generally the case, particularly as the number of bits increases, when ILHt beats all other methods, which are not able to increase precision as much as ILHt does.

p[c][c]precision K[][] recall[][]recall     CIFARprecision     precision     infinite MNISTprecision     precision

Figure 4: Comparison with different binary hashing methods in precision and precision/recall, using linear SVMs as hash functions, using different numbers of bits , for CIFAR and Infinite MNIST. Ground truth: all points with the same label as the query. Retrieved set: nearest neighbors, for a range of .

Runtime

The runtime to train a single ILHt hash function (in a single processor) for CIFAR is as follows:

Number of points
Time in seconds

This is much faster than other affinity-based hashing methods (for example, for bits with points, BRE did not converge after hours). KSHcut is among the faster methods. Its runtime per min-cut pass over a single bit is comparable to ours, but it needs sequential passes to complete just one alternating optimization iteration, while our functions can be trained in parallel.

Summary

ILHt achieves a remarkably high precision compared to a coupled KSH objective using the same optimization algorithm but introducing diversity by feeding different data to independent hash functions rather than by jointly optimizing over them. It also compares well with state-of-the-art methods in precision/recall, being competitive if few bits are used and the clear winner as more bits are used, and is very fast and embarrassingly parallel.

5 Discussion

We have revealed for the first time a connection between supervised binary hashing and ensemble learning that could open the door to many new hashing algorithms. Although we have focused on a specific objective (Laplacian) and identified as particularly successful with it a specific diversity mechanism (disjoint training sets), other choices may be better depending on the application. The core idea we propose is the independent training of the hash functions via the introduction of diversity by means other than coupling terms in the objective or constraints. This may come as a surprise in the area of learning binary hashing, where most work in the last few years has focused on proposing complex objective functions that couple all hash functions and developing sophisticated optimization algorithms for them.

Another surprise is that orthogonality of the codes or hash functions seems unnecessary. ILHt creates codes and hash functions that do differ from each other but are far from being orthogonal, yet they achieve good precision that keeps growing as we add bits. Thus, introducing diversity through different training data seems a better mechanism to make hash functions differ than coupling the codes through an orthogonality constraint or otherwise. It is also far simpler and faster to train independent single-bit hash functions.

A final surprise is that the wide variety of affinity-based objective functions in the -bit case reduces to a binary quadratic problem in the 1-bit case regardless of the form of the -bit objective (as long as it depends on Hamming distances only). In this sense, there is a unique objective in the 1-bit case.

There has been a prior attempt to use bagging (bootstrapped samples) with truncated PCA (Leng et al., 2014). Our experiments show that, while this improves truncated PCA, it performs poorly in supervised binary hashing. This is because PCA is unsupervised and does not use the user-provided similarity information, which may disagree with Euclidean distances in image space; and because estimating principal components from samples has low diversity. Also, PCA is computationally simple and there is little gain by bagging it, unlike the far more difficult optimization of supervised binary hashing.

Some supervised binary hashing work (Liu et al., 2012; Wang et al., 2012) has proposed to learn the hash functions sequentially, where the th function has an orthogonality-like constraint to force it to differ from the previous functions. Hence, this does not learn the functions independently and can be seen as a greedy optimization of a joint objective over all functions.

Binary hashing does differ from ensemble learning in one important point: the predictions of the classifiers (= hash functions) are not combined into a single prediction, but are instead concatenated into a binary vector (which can take possible values). The “labels” (the binary codes) for the “classifiers” (the hash functions) are unknown, and are implicitly or explicitly learned together with the hash functions themselves. This means that well-known error decompositions such as the error-ambiguity decomposition (Krogh and Vedelsby, 1995) and the bias-variance decomposition (Geman et al., 1992) do not apply. Also, the real goal of binary hashing is to do well in information retrieval measures such as precision and recall, but hash functions do not directly optimize this. A theoretical understanding of why diversity helps in learning binary hashing is an important topic of future work.

In this respect, there is also a relation with error-correcting output codes (ECOC) (Dietterich and Bakiri, 1995), an approach for multiclass classification. In ECOC, we represent each of the classes with a -bit binary vector, ensuring that is large enough for the vectors to be sufficiently separated in Hamming distance. Each bit corresponds to partitioning the classes into two groups. We then train binary classifiers, such as decision trees. Given a test pattern, we output as class label the one closest in Hamming distance to the -bit output of the classifiers. The redundant error-correcting codes allow for small errors in the individual classifiers and can improve performance. An ECOC can also be seen as an ensemble of classifiers where we manipulate the output targets (rather than the input features or training set) to obtain each classifier, and we apply majority vote on the final result (if the test output in classifier is 1, then all classes associated with 1 get a vote). The main benefit of ECOC seems to be in variance reduction, as in other ensemble methods (James and Hastie, 1998). Binary hashing can be seen as an ECOC with classes, one per training point, with the ECOC prediction for a test pattern (query) being the nearest-neighbor class codes in Hamming distance. However, unlike in ECOC, the binary hashing the codes are learned so they preserve neighborhood relations between training points. Also, while ideally all codes should be different (since a collision makes two originally different patterns indistinguishable, which will degrade some searches), this is not guaranteed in binary hashing.

A final, different example shows the important role of diversity, i.e., making the hash functions differ, in learning good hash functions. Some binary hashing methods optimize an objective essentially of the following form (Rastegari et al., 2015; Xia et al., 2015):

(6)

where  is a linear projection matrix of . The idea is to force the projections to be as close as possible to binary values. The orthogonality constraint ensures that trivial solutions (which would make all hash functions equal) are not optimal. Remarkably, the objective function (6) contains no explicit information about neighborhood preservation (as in affinity-based loss functions) or reconstruction of the input (as in autoencoders). Although orthogonal projections preserve Euclidean distances, this is not true if preserving only a few, binarized projections. Yet this can produce good hash functions if initialized from PCA or ITQ, which did learn projections that try to reconstruct the inputs optimally, and a local optimum of the (NP-complete) objective (6) may not be far from that. Thus, it would appear that part of the success of these approaches relies on the constraint providing a form of diversity among the hash functions.

6 Conclusion

Much work in supervised binary hashing has focused on designing sophisticated objective functions of the hash functions that force them to compete with each other while trying to preserve neighborhood information. We have shown, surprisingly, that training hash functions independently is not just simpler, faster and parallel, but also can achieve better retrieval quality, as long as diversity is introduced into each hash function’s objective function. This establishes a connection with ensemble learning and allows one to borrow techniques from it. We showed that having each hash function optimize a Laplacian objective on a disjoint subset of the data works well, and facilitates selecting the number of bits to use. Although our evidence is mostly empirical, the intuition behind it is sound and in agreement with the many results (also mostly empirical) showing the power of ensemble classifiers. The ensemble learning perspective suggests many ideas for future work, such as pruning a large ensemble or using other diversity techniques. It may also be possible to characterize theoretically the performance in precision of binary hashing depending on the diversity of the hash functions.

Appendix A Equivalence of objective functions in the single-bit case: proofs

In the main paper, we state that, in the single bit case (), the Laplacian, KSH and BRE loss functions over the vector  of binary codes for each data point can be written in the form of a binary quadratic function without linear term (or a MRF with quadratic potentials only):

(7)

with an appropriate, data-dependent neighborhood symmetric matrix  of . We can assume w.l.o.g. that , i.e., the diagonal elements of  are zero, since any diagonal values simply add a constant to .

More generally, consider an arbitrary objective function of a binary vector that has the form and which only depends on Hamming distances between bits , . This is the form of the affinity-based loss function used in many binary hashing papers, in the single-bit case. Each term of the function can be written as . This fact, already noted by Lin et al. (2013), is because a function of 2 binary variables can take 4 different values:

but if only depends on the Hamming distance of and then we have and . This can be achieved by , and the constant can be ignored when optimizing.

By a similar argument we can prove that an arbitrary function of 3 binary variables that depends only on their Hamming distances can be written as a quadratic function of the 3 variables.

However, this is not true in general. This can be seen by comparing the dimensions of the function spaces spanned by the arbitrary function and the quadratic function. Consider first a general quadratic function of binary variables . We can always take  symmetric (because ) and absorb its diagonal terms into the constant (because ), so we can write w.l.o.g. . This has free parameters. The vector of possible values of for all possible binary vectors  is a linear function of these free parameters, Hence, the dimension of the space of all quadratic functions is at most . Consider now an arbitrary function of binary variables that depends only on their Hamming distances. Although there are Hamming distances , they are all determined just by the first distances for . This is because, given , the distance determines for each and so the entire vector  and all the other distances. Also, given the distances for , the value produces a vector  whose bits are reversed from that produced by , so both have the same Hamming distances. Hence, we have free binary variables (the values of for ), which determine the vector of possible values of for all possible binary vectors . Hence, the dimension of the space of all arbitrary functions of Hamming distances is . Since for , the quadratic functions in general cannot represent all arbitrary binary functions of the Hamming distances using the same binary variables.

Finally, note that some objective functions which make sense in the -bit case with become trivial in the single-bit case. For example, the loss function for Minimal Loss Hashing (Norouzi and Fleet, 2011):

uses a hinge loss to implement the goal that similar points (having ) should differ by no more than bits and dissimilar points (having ) should differ by bits or more, where , , and is the Hamming distance between and . It is easy to see that in the single-bit case the loss becomes constant, independent of the codes—because using one bit the Hamming distance can be either 0 or 1 only.

Appendix B Orthogonality measure: proofs

In paragraph Are the codes orthogonal? of the main paper, we define a measure of orthogonality for either the binary codes or the hash function weight vectors , based on the matrices of normalized dot products, and (where the rows of  are normalized), respectively. Here we prove several statements we make in that paragraph.

Invariance to sign reversals

Given a matrix  of (either or ) with entries in , define as measure of orthogonality (where is the Frobenius norm):

(8)

That is, is the average of the squared off-diagonal elements of .

Theorem B.1.

is independent of sign reversals of the hash functions.

Proof.

Let SS be a diagonal matrix with diagonal entries . SS satisfies so it is orthogonal. Hence, . ∎

Distribution of the dot products of random vectors

As control hypothesis for the orthogonality of the binary codes or hash function vectors we used the distribution of dot products of random vectors. Here we give their mean and variance explicitly as a function of their dimension.

Theorem B.2.

Let be two random binary vectors of independent components, where take the value

with probability

. Let . Then and .

Proof.

Let . Clearly, takes the value with probability , so its mean is and its variance is , and are iid. Hence, using standard properties of the expectation and variance, we have that , and . (Furthermore, is Bernoulli and is binomial.) ∎

It is also possible to prove that, for random unit vectors of dimension with real components, their dot product has mean and variance .

Hence, as the dimension increases, the variance decreases, and the distribution of tends to a delta at . This means that random high-dimensional vectors are practically orthogonal. The “random” histograms (black line) in fig. 3 are based on a sample of random vectors (for , we sample the component of each weight vector uniformly in and then normalize the vector). They follow the theoretical distribution well.

Appendix C Additional experiments

In fig. 5 we also include results for an additional, unsupervised dataset, the Flickr 1 million image dataset (Huiskes et al., 2010). For Flickr, we randomly select images for test and the rest for training. We use MPEG-7 edge histogram features. Since no labels are available, we create pseudolabels for by declaring as similar points its true nearest neighbors (using the Euclidean distance) and as dissimilar points a random subset of points among the remaining points. As ground truth, we use the nearest neighbors of the query in Euclidean space. All hash functions are trained using points. Retrieved set: nearest neighbors of the query point in the Hamming space, for a range of .

The only important difference is that Locality-Sensitive Hashing (LSH) achieves a high precision in the Flickr dataset, considerably higher than that of KSHcut. This is understandable, for the following reasons: 1) Flickr is an unsupervised dataset, and the neighborhood information provided to KSHcut (and ILHt) in the form of affinities is limited to the small subset of positive and negative neighbors , while LSH has access to the full feature vector of every image. 2) The dimensionality of the Flickr feature vectors is quite small: . Still, ILHt beats LSH by a significant margin.

In addition to the methods we used in the supervised datasets, we compare ILHt with Spectral Hashing (SH) (Weiss et al., 2009), Iterative Quantization (ITQ) (Gong et al., 2013), Binary Autoencoder (BA) (Carreira-Perpiñán and Raziperchikolaei, 2015), thresholded PCA (tPCA), and Locality-Sensitive Hashing (LSH) (Andoni and Indyk, 2008). Again, ILHt beats all other state-of-the-art methods, or is comparable to the best of them, particularly as the number of bits increases.

dD L ILHf ILHitf ILHt: training set sampling Incremental ILHt precision, Flickr dD[][B] 32bits 64bits 128bits dD[][B] 32bits 64bits 128bits L[][B]number of bits L[][B]number of bits
histogram     Flickr bin[t][]entries of     bin[t][]entries of
p[c][c]precision K[][] recall[][]recall     Flickrprecision     precision

Figure 5: Results for the Flickr dataset (unsupervised). The top, middle and bottom panels correspond to figures 2, 3 and 4 in the main paper. Ground truth: the first nearest neighbors of the query in the original space. Retrieved set: nearest neighbors of the query.

Acknowledgments

Work supported by NSF award IIS–1423515.

References

  • Andoni and Indyk (2008) A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Comm. ACM, 51(1):117–122, Jan. 2008.
  • Belkin and Niyogi (2003) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, June 2003.
  • Boros and Hammer (2002) E. Boros and P. L. Hammer. Pseudo-boolean optimization. Discrete Applied Math., 123(1–3):155–225, Nov. 15 2002.
  • Boykov et al. (2001) Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence, 23(11):1222–1239, Nov. 2001.
  • Breiman (2001) L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct. 2001.
  • Breiman (1996) L. J. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug. 1996.
  • Carreira-Perpiñán (2010) M. Á. Carreira-Perpiñán. The elastic embedding algorithm for dimensionality reduction. In J. Fürnkranz and T. Joachims, editors, Proc. of the 27th Int. Conf. Machine Learning (ICML 2010), pages 167–174, Haifa, Israel, June 21–25 2010.
  • Carreira-Perpiñán and Raziperchikolaei (2015) M. Á. Carreira-Perpiñán and R. Raziperchikolaei. Hashing with binary autoencoders. In

    Proc. of the 2015 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’15)

    , pages 557–566, Boston, MA, June 7–12 2015.
  • Carreira-Perpiñán and Wang (2014) M. Á. Carreira-Perpiñán and W. Wang. Distributed optimization of deeply nested systems. In S. Kaski and J. Corander, editors,

    Proc. of the 17th Int. Conf. Artificial Intelligence and Statistics (AISTATS 2014)

    , pages 10–19, Reykjavik, Iceland, Apr. 22–25 2014.
  • Cormen et al. (2009) T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, Cambridge, MA, third edition, 2009.
  • Dietterich (2000) T. G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, pages 1–15. Springer-Verlag, 2000.
  • Dietterich and Bakiri (1995) T. G. Dietterich and G. Bakiri. Solving multi-class learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:253–286, 1995.
  • Fan et al. (2008) R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. J. Machine Learning Research, 9:1871–1874, Aug. 2008.
  • Garey and Johnson (1979) M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979.
  • Ge et al. (2014) T. Ge, K. He, and J. Sun. Graph cuts for supervised binary coding. In Proc. 13th European Conf. Computer Vision (ECCV’14), pages 250–264, Zürich, Switzerland, Sept. 6–12 2014.
  • Geman et al. (1992) S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, Jan. 1992.
  • Gong et al. (2013) Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin. Iterative quantization: A Procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Analysis and Machine Intelligence, 35(12):2916–2929, Dec. 2013.
  • Grauman and Fergus (2013) K. Grauman and R. Fergus. Learning binary hash codes for large-scale image search. In R. Cipolla, S. Battiato, and G. Farinella, editors, Machine Learning for Computer Vision, pages 49–87. Springer-Verlag, 2013.
  • Greig et al. (1989) D. M. Greig, B. T. Porteous, and A. H. Seheult.

    Exact maximum a posteriori estimation for binary images.

    Journal of the Royal Statistical Society, B, 51(2):271–279, 1989.
  • Huiskes et al. (2010) M. J. Huiskes, B. Thomee, and M. S. Lew. New trends and ideas in visual concept detection: The MIR Flickr Retrieval Evaluation Initiative. In Proc. ACM Int. Conf. Multimedia Information Retrieval, pages 527–536, New York, NY, USA, 2010.
  • James and Hastie (1998) G. James and T. Hastie. The error coding method and PICTs. Journal of Computational and Graphical Statistics, 7(3):377–387, Sept. 1998.
  • Kolmogorov and Zabih (2003) V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? IEEE Trans. Pattern Analysis and Machine Intelligence, 26(2):147–159, Feb. 2003.
  • Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Dept. of Computer Science, University of Toronto, Apr. 8 2009.
  • Krogh and Vedelsby (1995) A. Krogh and J. Vedelsby.

    Neural network ensembles, cross validation, and active learning.

    In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems (NIPS), volume 7, pages 231–238. MIT Press, Cambridge, MA, 1995.
  • Kulis and Darrell (2009) B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS), volume 22, pages 1042–1050. MIT Press, Cambridge, MA, 2009.
  • Kuncheva (2014) L. I. Kuncheva. Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons, second edition, 2014.
  • Leng et al. (2014) C. Leng, J. Cheng, T. Yuan, X. Bai, and H. Lu. Learning binary codes with bagging PCA. In T. Calders, F. Esposito, E. Hüllermeier, and R. Meo, editors, Proc. of the 25th European Conf. Machine Learning (ECML–14), pages 177–192, Nancy, France, Sept. 15–19 2014.
  • Lin et al. (2014a) B. Lin, J. Yang, X. He, and J. Ye. Geodesic distance function learning via heat flows on vector fields. In E. P. Xing and T. Jebara, editors, Proc. of the 31st Int. Conf. Machine Learning (ICML 2014), pages 145–153, Beijing, China, June 21–26 2014a.
  • Lin et al. (2013) G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general two-step approach to learning-based hashing. In Proc. 14th Int. Conf. Computer Vision (ICCV’13), pages 2552–2559, Sydney, Australia, Dec. 1–8 2013.
  • Lin et al. (2014b) G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter.

    Fast supervised hashing with decision trees for high-dimensional data.

    In Proc. of the 2014 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’14), pages 1971–1978, Columbus, OH, June 23–28 2014b.
  • Liu et al. (2011) W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In L. Getoor and T. Scheffer, editors, Proc. of the 28th Int. Conf. Machine Learning (ICML 2011), pages 1–8, Bellevue, WA, June 28 – July 2 2011.
  • Liu et al. (2012) W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proc. of the 2012 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’12), pages 2074–2081, Providence, RI, June 16–21 2012.
  • Loosli et al. (2007) G. Loosli, S. Canu, and L. Bottou.

    Training invariant support vector machines using selective sampling.

    In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines, Neural Information Processing Series, pages 301–320. MIT Press, 2007.
  • Norouzi and Fleet (2011) M. Norouzi and D. Fleet. Minimal loss hashing for compact binary codes. In L. Getoor and T. Scheffer, editors, Proc. of the 28th Int. Conf. Machine Learning (ICML 2011), Bellevue, WA, June 28 – July 2 2011.
  • Oliva and Torralba (2001) A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Computer Vision, 42(3):145–175, May 2001.
  • Rastegari et al. (2015) M. Rastegari, C. Keskin, P. Kohli, and S. Izadi. Computationally bounded retrieval. In Proc. of the 2015 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7–12 2015.
  • Raziperchikolaei and Carreira-Perpiñán (2015) R. Raziperchikolaei and M. Á. Carreira-Perpiñán. Learning hashing with affinity-based loss functions using auxiliary coordinates. arXiv:1501.05352 [cs.LG], Jan. 21 2015.
  • Roweis and Saul (2000) S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, Dec. 22 2000.
  • Shakhnarovich et al. (2006) G. Shakhnarovich, P. Indyk, and T. Darrell, editors. Nearest-Neighbor Methods in Learning and Vision. Neural Information Processing Series. MIT Press, Cambridge, MA, 2006.
  • van der Maaten and Hinton (2008) L. J. P. van der Maaten and G. E. Hinton. Visualizing data using -SNE. J. Machine Learning Research, 9:2579–2605, Nov. 2008.
  • Wang et al. (2012) J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large scale search. IEEE Trans. Pattern Analysis and Machine Intelligence, 34(12):2393–2406, Dec. 2012.
  • Weiss et al. (2009) Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In D. Koller, Y. Bengio, D. Schuurmans, L. Bottou, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS), volume 21, pages 1753–1760. MIT Press, Cambridge, MA, 2009.
  • Xia et al. (2015) Y. Xia, K. He, P. Kohli, and J. Sun. Sparse projections for high-dimensional binary codes. In Proc. of the 2015 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, June 7–12 2015.
  • Yu and Shi (2003) S. X. Yu and J. Shi.

    Multiclass spectral clustering.

    In Proc. 9th Int. Conf. Computer Vision (ICCV’03), pages 313–319, Nice, France, Oct. 14–17 2003.
  • Zhang et al. (2010) D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In Proc. of the 33rd ACM Conf. Research and Development in Information Retrieval (SIGIR 2010), pages 18–25, Geneva, Switzerland, July 19–23 2010.
  • Zhou (2012) Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Chapman & Hall/CRC Machine Learning and Pattern Recognition Series. CRC Publishers, 2012.