Learning to Hash
An attractive approach for fast search in image databases is binary hashing, where each high-dimensional, real-valued image is mapped onto a low-dimensional, binary vector and the search is done in this binary space. Finding the optimal hash function is difficult because it involves binary constraints, and most approaches approximate the optimization by relaxing the constraints and then binarizing the result. Here, we focus on the binary autoencoder model, which seeks to reconstruct an image from the binary code produced by the hash function. We show that the optimization can be simplified with the method of auxiliary coordinates. This reformulates the optimization as alternating two easier steps: one that learns the encoder and decoder separately, and one that optimizes the code for each image. Image retrieval experiments, using precision/recall and a measure of code utilization, show the resulting hash function outperforms or is competitive with state-of-the-art methods for binary hashing.READ FULL TEXT VIEW PDF
Representing images by compact hash codes is an attractive approach for
In supervised binary hashing, one wants to learn a function that maps a
Recent binary representation learning models usually require sophisticat...
Embedding image features into a binary Hamming space can improve both th...
Unsupervised binary representation allows fast data retrieval without an...
Hashing methods have been widely investigated for fast approximate neare...
Parallel best-first search algorithms such as Hash Distributed A* (HDA*)...
Learning to Hash
We consider the problem of binary hashing, where given a high-dimensional vector , we want to map it to an -bit vector using a hash function , while preserving the neighbors of in the binary space. Binary hashing has emerged in recent years as an effective technique for fast search on image (and other) databases. While the search in the original space would cost in both time and space, using floating point operations, the search in the binary space costs where and the constant factor is much smaller. This is because the hardware can compute binary operations very efficiently and the entire dataset ( bits) can fit in the main memory of a workstation. And while the search in the binary space will produce some false positives and negatives, one can retrieve a larger set of neighbors and then verify these with the ground-truth distance, while still being efficient.
Many different hashing approaches have been proposed in the last few years. They formulate an objective function of the hash function or of the binary codes that tries to capture some notion of neighborhood preservation. Most of these approaches have two things in common: typically performs dimensionality reduction () and, as noted, it outputs binary codes (). The latter implies a step function or binarization applied to a real-valued function of the input . Optimizing this is difficult. In practice, most approaches follow a two-step procedure: first they learn a real hash function ignoring the binary constraints and then the output of the resulting hash function is binarized (e.g. by thresholding or with an optimal rotation). For example, one can run a continuous dimensionality reduction algorithm (by optimizing its objective function) such as PCA and then apply a step function. This procedure can be seen as a “filter” approach (Kohavi and John, 1998) and is suboptimal: in the example, the thresholded PCA projection is not necessarily the best thresholded linear projection (i.e., the one that minimizes the objective function under all thresholded linear projections). To obtain the latter, we must optimize the objective jointly over linear mappings and thresholds, respecting the binary constraints while learning ; this is a “wrapper” approach (Kohavi and John, 1998). In other words, optimizing real codes and then projecting them onto the binary space is not the same as optimizing the codes in the binary space.
In this paper we show that this joint optimization, respecting the binary constraints during training, can actually be carried out reasonably efficiently. The idea is to use the recently proposed method of auxiliary coordinates (MAC) (Carreira-Perpiñán and Wang, 2012, 2014). This is a general strategy to transform an original problem involving a nested function into separate problems without nesting, each of which can be solved more easily. In our case, this allows us to reduce drastically the complexity due to the binary constraints. We focus on binary autoencoders, i.e., where the code layer is binary. We believe we are the first to apply MAC to this model and construct an efficient optimization algorithm. Section 3 describes the binary autoencoder model and objective function. Section 4 derives a training algorithm using MAC and explains how, with carefully implemented steps, the optimization in the binary space can be carried out efficiently, and parallelizes well. Our hypothesis is that constraining the optimization to the binary space results in better hash functions and we test this in experiments (section 6), using several performance measures: the traditional precision/recall, as well as the reconstruction error and an entropy-based measure of code utilization (which we propose in section 5). These show that linear hash functions resulting from optimizing a binary autoencoder using MAC are consistently competitive with the state-of-the-art, even when the latter uses nonlinear hash functions or more sophisticated objective functions for hashing.
The most basic hashing approaches are data-independent, such as Locality-Sensitive Hashing (LSH) (Andoni and Indyk, 2008), which is based on random projections and thresholding, and kernelized LSH (Kulis and Grauman, 2012). Generally, they are outperformed by data-dependent methods, which learn a specific hash function for a given dataset in an unsupervised or supervised way. We focus here on unsupervised, data-dependent approaches. These are typically based on defining an objective function (usually based on dimensionality reduction) either of the hash function or the binary codes, and optimizing it. However, this is usually achieved by relaxing the binary codes to a continuous space and thresholding the resulting continuous solution. For example, spectral hashing (Weiss et al., 2009)
is essentially a version of Laplacian eigenmaps where the binary constraints are relaxed and approximate eigenfunctions are computed that are then thresholded to provide binary codes. Variations of this include using AnchorGraphs(Liu et al., 2011)
to define the eigenfunctions, or obtaining the hash function directly as a binary classifier using the codes from spectral hashing as labels(Zhang et al., 2010). Other approaches optimize instead a nonlinear embedding objective that depends on a continuous, parametric hash function, which is then thresholded to define a binary hash function (Torralba et al., 2008; Salakhutdinov and Hinton, 2009); or an objective that depends on a thresholded hash function, but the threshold is relaxed during the optimization (Norouzi and Fleet, 2011). Some recent work has tried to respect the binary nature of the codes or thresholds by using alternating optimization directly on an objective function, over one entry or one row of the weight matrix in the hash function (Kulis and Darrell, 2009; Neyshabur et al., 2013), or over a subset of the binary codes (Lin et al., 2013, 2014). Since the objective function involves a large number of terms and all the binary codes or weights are coupled, the optimization is very slow. Also, Lin et al. (2013, 2014) learn the hash function after the codes have been fixed, which is suboptimal.
The closest model to our binary autoencoder is Iterative Quantization (ITQ) (Gong et al., 2013), a fast and competitive hashing method. ITQ first obtains continuous low-dimensional codes by applying PCA to the data and then seeks a rotation that makes the codes as close as possible to binary. The latter is based on the optimal discretization algorithm of Yu and Shi (2003)
, which finds a rotation of the continuous eigenvectors of a graph Laplacian that makes them as close as possible to a discrete solution, as a postprocessing for spectral clustering. The ITQ objective function is
where of are the continuous codes obtained by PCA. This is an NP-complete problem, and a local minimum is found using alternating optimization over , with solution elementwise, and over , which is a Procrustes alignment problem with a closed-form solution based on a SVD. The final hash function is , which has the form of a thresholded linear projection. Hence, ITQ is a postprocessing of the PCA codes, and it can be seen as a suboptimal approach to optimizing a binary autoencoder, where the binary constraints are relaxed during the optimization (resulting in PCA), and then one “projects” the continuous codes back to the binary space. Semantic hashing (Salakhutdinov and Hinton, 2009)
also uses an autoencoder objective with a deep encoder (consisting of stacked RBMs), but again its optimization uses heuristics that are not guaranteed to converge to a local optimum: either training it as a continuous problem with backpropagation and then applying a threshold to the encoder(Salakhutdinov and Hinton, 2009), or rounding the encoder output to 0 or 1 during the backpropagation forward pass but ignoring the rounding during the backward pass (Krizhevsky and Hinton, 2011).
We consider a well-known model for continuous dimensionality reduction, the (continuous) autoencoder, defined in a broad sense as the composition of an encoder which maps a real vector onto a real code vector (with ), and a decoder which maps back to in an effort to reconstruct . Although our ideas apply more generally to other encoders, decoders and objective functions, in this paper we mostly focus on the least-squares error with a linear encoder and decoder. As is well known, the optimal solution is PCA.
For hashing, the encoder maps continuous inputs onto binary code vectors with bits, . Let us write ( includes a bias by having an extra dimension for each ) where and is a step function applied elementwise, i.e., if and otherwise (we can fix the threshold at 0 because the bias acts as a threshold for each bit). Our desired hash function will be , and it should minimize the following problem, given a dataset of high-dimensional patterns :
which is the usual least-squares error but where the code layer is binary. Optimizing this nonsmooth function is difficult and NP-complete. Where the gradients do exist wrt they are zero nearly everywhere. We call this a binary autoencoder (BA).
We will also consider a related model (see later):
where is linear and we optimize over the decoder and the binary codes of each input pattern. Without the binary constraint, i.e., , this model dates back to the 50s and is sometimes called least-squares factor analysis (Whittle, 1952), and its solution is PCA. With the binary constraints, the problem is NP-complete because it includes as particular case solving a linear system over , which is an integer LP feasibility problem. We call this model (least-squares) binary factor analysis (BFA). We believe this model has not been studied before, at least in hashing. A hash function can be obtained from BFA by fitting a binary classifier of the inputs to each of the code bits. It is a filter approach, while the BA is the optimal (wrapper) approach, since it optimizes jointly over and .
We use the recently proposed method of auxiliary coordinates (MAC) (Carreira-Perpiñán and Wang, 2012, 2014). The idea is to break nested functional relationships judiciously by introducing variables as equality constraints. These are then solved by optimizing a penalized function using alternating optimization over the original parameters and the coordinates, which results in a coordination-minimization (CM) algorithm. Recall eq. (2), this is our nested problem, where the model is . We introduce as auxiliary coordinates the outputs of , i.e., the codes for each of the input patterns, and obtain the following equality-constrained problem:
Note the codes are binary. We now apply the quadratic-penalty method (it is also possible to apply the augmented Lagrangian method instead; Nocedal and Wright, 2006) and minimize the following objective function while progressively increasing , so the constraints are eventually satisfied:
Now we apply alternating optimization over and . This results in the following two steps:
Over for fixed , the problem separates for each of the codes. The optimal code vector for pattern tries to be close to the prediction while reconstructing well.
Over for fixed , we obtain independent problems for each of the single-bit hash functions (which try to predict optimally from ), and for (which tries to reconstruct optimally from ).
We can now see the advantage of the auxiliary coordinates: the individual steps are (reasonably) easy to solve (although some work is still needed, particularly for the step), and besides they exhibit significant parallelism. We describe the steps in detail below. The resulting algorithm alternates steps over the encoder ( classifications) and decoder (one regression) and over the codes ( binary proximal operators; Rockafellar, 1976; Combettes and Pesquet, 2011). During the iterations, we allow the encoder and decoder to be mismatched, since the encoder output does not equal the decoder input, but they are coordinated by and as increases the mismatch is reduced. The overall MAC algorithm to optimize a BA is in fig. 1.
Although a MAC algorithm can be shown to produce convergent algorithms as with a differentiable objective function, we cannot apply the theorem of Carreira-Perpiñán and Wang (2012) because of the binary nature of the problem. Instead, we show that our algorithm converges to a local minimum for a finite , where “local minimum” is understood as in -means: a point where is globally minimum given and vice versa. The following theorem is valid for any choice of and , not just linear.
Assume the steps over and are solved exactly. Then the MAC algorithm for the binary autoencoder stops at a finite .
appears only in the step, and if does not change there, and will not change either, since the and steps are exact. The step over minimizes , and from theorem 4.3 we have that is a global minimizer if . The statement follows from the fact that is bounded over all , and . Let us prove this fact. Clearly this holds for fixed because take values on a finite set, namely . As for , even if the set of functions is infinite, the number of different functions that are possible is finite, because results from an exact fit to , where is fixed and the set of possible is finite (since each is binary). ∎
The minimizers of trace a path as a function of in the space. BA and BFA can be seen as the limiting cases of when and , respectively (for BFA, and can be optimized independently from , but must optimally fit the resulting ). Figure 2 shows graphically the connection between the BA and BFA objective functions, as the two ends of the continuous path in space in which the quadratic-penalty function is defined.
In practice, to learn the BFA model we set to a small value and keep it constant while running the BA algorithm. As for BA itself, we increase (times a constant factor, e.g. ) and iterate the and steps for each value. Usually the algorithm stops in 10 to 15 iterations, when no further changes to the parameters occur.
With a linear decoder this is a simple linear regression
with data . A bias parameter is necessary, for example to be able to use instead of equivalently as bit values. The solution of this regression is (ignoring the bias for simplicity) and can be computed in . Note the constant factor in the -notation is small because is binary, e.g. involves only sums, not multiplications.
This has the following form
Since and are binary, is the Hamming distance and the objective function is the number of misclassified patterns, so it separates for each bit. So it is a classification problem for each bit, using as labels the auxiliary coordinates, where
is a linear classifier (a perceptron). However, rather than minimizing this, we will solve an easier, closely related problem: fit a linear SVMto where we use a high penalty for misclassified patterns but optimize the margin plus the slack. Besides being easier (by reusing well-developed codes for SVMs), this surrogate loss has the advantage of making the solution unique (no local optima) and generalizing better to test data (maximum margin). Also, although we used a quadratic penalty, the spirit of penalty methods is to penalize constraint violations () increasingly. Since in the limit the constraints are satisfied exactly, the classification error using is zero, hence the linear SVM will find an optimum of the nested problem anyway. We use LIBLINEAR (Fan et al., 2008) with warm start (i.e., the SVM optimization is initialized from the previous iteration’s SVM). Note the SVMs and the decoder function can be trained in parallel.
From eq. (5), this is a binary optimization on variables, but it separates into independent optimizations each on only variables, with the form of a binary proximal operator (Moreau, 1962; Rockafellar, 1976; Combettes and Pesquet, 2011) (where we omit the index ):
Thus, although the problem over each is binary and NP-complete, a good or even exact solution may be obtained, because practical values of are small (typically 8 to 32 bits). Further, because of the intensive computation and large number of independent problems, this step can take much advantage of parallel processing.
We have spent significant effort into making this step efficient while yielding good, if not exact, solutions. Before proceeding, let us show111This result can also be derived by expanding the norms into an matrix and computing its Cholesky decomposition. However, the product squares the singular values of
squares the singular values ofand loses precision because of roundoff error (Golub and van Loan, 1996, p. 237ff; Nocedal and Wright, 2006, pp. 251). how to reduce the problem, which as stated uses a matrix of , to an equivalent problem using a matrix of .
Let and , with QR factorisation , where is of with and is upper triangular of , and . The following two problems have the same minima over :
Let of be orthogonal, where the columns of are an orthonormal basis of the nullspace of . Then, since orthogonal matrices preserve Euclidean distances, we have: , where the term does not depend on . ∎
This achieves a speedup of (where the factor comes from the fact that the new matrix is triangular), e.g. this is if using bits with GIST features in our experiments. Henceforth, we redefine the step as s.t. .
For small , this can be solved exactly by enumeration, at a worst-case runtime cost , but with small constant factors in practice (see accelerations below). is perfectly practical in a workstation without parallel processing for the datasets in our experiments.
For larger , we use alternating optimization over groups of bits (where the optimization over a -bit group is done by enumeration and uses the same accelerations). This converges to a local minimum of the step, although we find in our experiments that it finds near-global optima if using a good initialization. Intuitively, it makes sense to warm-start this, i.e., to initialize to the code found in the previous iteration’s step, since this should be close to the new optimum as we converge. However, empirically we find that the codes change a lot in the first few iterations, and that the following initialization works better (in leading to a lower objective value) in early iterations: we solve the relaxed problem on s.t. rather than . This is a strongly convex bound-constrained quadratic program (QP) in variables for and its unique minimizer can be found efficiently.
We can further speed up the solution by noting that we have QPs with some common, special structure. The objective is the sum of a term having the same matrix for all QPs, and a term that is separable in . We have developed an ADMM algorithm (Carreira-Perpiñán, 2014) that is very simple, parallelizes or vectorizes very well, and reuses matrix factorizations over all QPs. It is – faster than Matlab’s quadprog. We warm-start it from the continuous solution of the QP in the previous step.
In order to binarize the continuous minimizer for we could simply round its elements, but instead we apply a greedy procedure that is efficient and better (though still suboptimal). We optimally binarize from bit 1 to bit by evaluating the objective function for bit in with all remaining elements fixed (elements to are already binary and to are still continuous) and picking the best. Essentially, this is one pass of alternating optimization but having continuous values for some of the bits.
Finally, we pick the best of the binarized relaxed solution or the warm-start value and run alternating optimization. This ensures that the quadratic-penalty function (5) decreases monotonically at each iteration.
Naively, the enumeration involves evaluating for (or ) vectors, where evaluating for one costs on average roughly multiplications and sums. This enumeration can be sped up or pruned while still finding a global minimum by using upper bounds on , incremental computation of , and necessary and sufficient conditions for the solution. Essentially, we need not evaluate every code vector, or every bit of every code vectors; we know the solution will be “near” ; and we can recognize the solution when we find it.
Call a global minimizer of . An initial, good upper bound is . In fact, we have the following sufficient condition for to be a global minimizer. (We give it generally for any decoder .)
Let . Then: (1) A global minimizer of is at a Hamming distance from of or less. (2) If then is a global minimizer.
. (2) follows because the Hamming distance is integer. ∎
As increases and improves, this bound becomes more effective and more of the patterns are pruned. Upon convergence, the step costs only . If we do have to search for a given , we keep a running bound (current best minimum) , and we scan codes in increasing Hamming distance to up to a distance of . Thus, we try first the codes that are more likely to be optimal, and keep refining the bound as we find better codes.
Second, since separates over dimensions , we evaluate it incrementally (dimension by dimension) and stop as soon as we exceed the running bound.
Finally, there exist global optimality necessary and sufficient conditions for binary quadratic problems that are easy to evaluate (Beck and Teboulle, 2000; Jeyakumar et al., 2007) (see appendix A). This allows us to recognize the solution as soon as we reach it and stop the search (rather than do a linear search of all values, keeping track of the minimum). These conditions can also determine whether the continuous solution to the relaxed QP is a global minimizer of the binary QP.
The only user parameters in our method are the initialization for the binary codes and the schedule for the penalty parameter (sequence of values ), since we use a penalty or augmented Lagrangian method. In general with these methods, setting the schedule requires some tuning in practice. Fortunately, this is simplified in our case for two reasons. 1) We need not drive because termination occurs at a finite and can be easily detected: whenever at the end of the step, no further changes to the parameters can occur. This gives a practical stopping criterion. 2) In order to generalize well to unseen data, we stop iterating not when we (sufficiently) optimize , but when the precision in a validation set decreases. This is a form of early stopping that guarantees that we improve (or leave unchanged) the initial , and besides is faster. The initialization for and further details about the schedule for appear in section 6.
Here we propose an evaluation measure of binary hash functions that has not been used before as far as we know. Any binary hash function maps a population of high-dimensional real vectors onto a population of -bit vectors (where is fixed). Intuitively, a good hash function should make best use of the available codes and use each bit equally (since no bit is preferable to any other bit). For example, if we have bits and distinct real vectors, a good hash function would ideally assign a different 32-bit code to each vector, in order to avoid collisions. Given an -bit hash function and a dataset of real vectors , we then obtain the -bit binary codes . We can measure the code utilization of the hash function for the dataset by the entropy of the code distribution, , defined as follows. Let be the number of vectors that map to binary code for . Then the (sample) code distribution
is a discrete probability distribution defined over the-bit integers and has probability for code , since , i.e., the code probabilities are the normalized counts computed over the data points. This works whether is smaller or larger than (if there will be unused codes). If the dataset is a sample of a distribution of high-dimensional vectors, then
is an estimate of the code usage induced by the hash functionfor the distribution based on a sample of size . The entropy of is
which is measured in bits. The entropy is a real number and satisfies . It is when all codes are coincident (hence only one of the available codes are used). It is when (the number of available codes is no more than the number of data points) and for all
(uniform distribution). It iswhen and all codes are distinct. Hence, the entropy is large the more available codes are used, and the more uniform their use is. Since the entropy is measured in bits and cannot be more than , can be said to measure the effective number of bits of the code distribution induced by the hash function on dataset .
A good hash function will have a good code utilization, making use of the available codes to avoid collisions and preserve neighbors. However, it is crucial to realize that an optimal code utilization does by itself not necessarily result in an optimal hash function, and it is not a fully reliable proxy for precision/recall. Indeed, code utilization is not directly related to the distances between data vectors, and it is easy to construct hash functions that produce an optimal code utilization but are not necessarily very good in preserving neighbors. For example, we can pick the first hash function as a hyperplane that splits the space into two half spaces each with half the data points (this is a high-dimensional median). Then, we do this within each half to set the second hash function, etc. This generates a binary decision tree with oblique cuts, which is impractical because it has
hyperplanes, one per internal node. If the real vectors follow a Gaussian distribution (or any other axis-symmetric distribution) in, then using as hash functions any thresholded principal components will give maximum entropy (as long as ). This is because each of the hash functions is a hyperplane that splits the space into two half spaces containing half of the vectors, and the hyperplanes are orthogonal. This gives the thresholded PCA (tPCA) method mentioned earlier, which is generally not competitive with other methods, as seen in our experiments. More generally, we may expect tPCA and random projections through the mean to achieve high code utilization, because for most high-dimensional datasets (of arbitrary distribution), most low-dimensional projections are approximately Gaussian (Diaconis and Freedman, 1984).
With this caveat in mind, code utilization measured in effective number of bits is still a useful evaluation measure for hash functions. It also has an important advantage: it does not depend on any user parameters. In particular, it does not depend on the ground truth size ( nearest neighbors in data space) or retrieved set size (given by the nearest neighbors or the vectors with codes within Hamming distance ). This allows us to compare all binary hashing methods with a single number (for a given number of bits ). We report values of in the experiments section and show that it indeed correlates well with precision in terms of the ranking of methods, particularly for methods that use the same model and objective function (such as the binary autoencoder reconstruction error with ITQ and MAC).
We used three datasets in our experiments, commonly used as benchmarks for image retrieval. (1) CIFAR (Krizhevsky, 2009) contains color images in classes. We ignore the labels in this paper and use images as training set and images as test set. We extract GIST features (Oliva and Torralba, 2001) from each image. (2) NUS-WIDE (Chua et al., 2009) contains high-resolution color images and use for training and for test. We extract wavelet features (Oliva and Torralba, 2001) from each image. In some experiments we also use the NUS-WIDE-LITE subset of this dataset, containing images for training and images for test. (3) SIFT-1M (Jégou et al., 2011) contains training high-resolution color images and test images, each represented by SIFT features.
We report precision and recall (%) in the test set using as true neighbors thenearest images in Euclidean distance in the original space, and as retrieved neighbors in the binary space we either use the nearest images in Hamming distance, or the images within a Hamming distance (if no images satisfy the latter, we report zero precision). We also compare algorithms using our entropy-based measure of code utilization described in section 5.
Our experiments evaluate the effectiveness of our algorithm to minimize the BA objective and whether this translates into better hash functions (i.e., better image retrieval); its runtime and parallel speedup; and its precision and recall and code utilization compared to representative state-of-the-art algorithms.
We focus purely on the BA objective function (reconstruction error) and study the gain obtained by the MAC optimization, which respects the binary constraints, over the suboptimal, “filter” approach of relaxing the constraints (i.e., PCA) and then binarizing the result by thresholding at 0 (tPCA) or by optimal rotation (ITQ). To compute the reconstruction error for tPCA and ITQ we find the optimal mapping given their binary codes. We use the NUS-WIDE-LITE subset of the NUS-WIDE dataset. We initialize BA from AGH and BFA from tPCA and use alternating optimization in the steps. We search for true neighbors and report results over a range of to bits in fig. 3. We can see that BA dominates all other methods in reconstruction error, as expected, and also in precision, as one might expect. tPCA is consistently the worst method by a significant margin, while ITQ and BFA are intermediate. Hence, the more we respect the binary constraints during the optimization, the better the hash function. Further experiments below consistently show that the BA precision significantly increases over the (AGH) initialization and is leading or competitive over other methods.
We study the MAC optimization if doing an inexact step by using alternating optimization over groups of bits. Specifically, we study the effect on the number of iterations and runtime of the group size and of the initialization (warm-start vs relaxed QP). Fig. 4 shows the results in the CIFAR dataset using bits (so using gives an exact optimization), without using a validation-based stopping criterion (so we do optimize the training objective).
Surprisingly, the warm-start initialization leads to worse BA objective function values than the binarized relaxed one. Fig. 4(left) shows the dashed lines (warm-start for different ) are all above the solid lines (relaxed for different ). The reason is that, early during the optimization, the codes undergo drastic changes from one iteration to the next, so the warm-start initialization is farther from a good optimum than the relaxed one. Late in the optimization, when the codes change slowly, the warm-start does perform well. The relaxed initialization resulting optima are almost the same as using the exact binary optimization.
Also surprisingly, different group sizes eventually converge to almost the same result as using the exact binary optimization if using the relaxed initialization. (If using warm-start, the larger the better the result, as one would expect.) Likewise, in fig. 3, if using alternating optimization in the step rather than enumeration, the curves for BA and BFA barely vary. But, of course, the runtime per iteration grows exponentially on (middle panel).
Hence, it appears that using faster, inexact steps does not impair the model learnt, and we settle on with relaxed initialization as default for all our remaining experiments (unless we use bits, in which case we simply use enumeration).
Fig. 5 shows the BA training time speedup achieved with parallel processing, in CIFAR with bits. We use the Matlab Parallel Processing Toolbox with up to 12 processors and simply replace “for” with “parfor” loops so each iteration (over points in the step, over bits in the step) is run in a different processor. We observe a nearly perfect scaling for this particular problem. Note that, the larger the number of bits, the larger the parallelization ability in the step. As a rough indication of runtimes for BA, training the CIFAR images and NUS-WIDE images using bits with alternating optimization in the step takes 20’ and 50’, respectively (in a 4-core laptop).
Since the objective is nonconvex, our result does depend on the initial codes (in the first iteration of the MAC algorithm), but we are guaranteed to improve or leave unchanged the precision (in the validation set) of the codes produced by any algorithm. We have observed that initializing from AGH (Liu et al., 2011) tends to produce best results overall, so we use this in all the experiments. Using ITQ (Gong et al., 2013) produces also good results (occasionally better but generally somewhat worse than AGH), and is a simpler and faster option if so desired. We initialize BFA from tPCA, since this seems to work best.
In order to be able to use a fixed schedule, we make the data zero-mean and rescale it so the largest feature range is 1. This does not alter the Euclidean distances and normalizes the scale. We start with and double it after each iteration (one and step). As noted in section 4, the algorithm will skip values that do not improve the precision in the validation set, and will stop at a finite value (past which no further changes occur).
It is of course possible to tweak all these settings ( schedule and initial ) and obtain better results, but these defaults seem robust.
We compare BA and BFA with the following algorithms: thresholded PCA (tPCA), Iterative Quantization (ITQ) (Gong et al., 2013), Spectral Hashing (SH) (Weiss et al., 2009), Kernelized Locality-Sensitive Hashing (KLSH) (Kulis and Grauman, 2012), AnchorGraph Hashing (AGH) (Liu et al., 2011), and Spherical Hashing (SPH) (Heo et al., 2012). Note several of these learn nonlinear hash functions and use more sophisticated error functions (that better approximate the nearest neighbor ideal), while our BA uses a linear hash function and simply minimizes the reconstruction error. All experiments use the output of AGH and tPCA to initialize BA and BFA, respectively.
It is known that the retrieval performance of a given algorithm depends strongly on the size of the neighbor set used, so we report experiments with small and large number of points in the ground truth set. For NUS-WIDE, we considered as ground truth and neighbors of the query point, and as set of retrieved neighbors, we retrieve either nearest neighbors (, ) or neighbors within Hamming distance (, ). Fig. 6 shows the results. For ANNSIFT-1M, we considered ground truth neighbors and set of retrieved neighbors for or to . Fig. 7 shows the results. All curves are the average over the test set. We also show precision and recall results for the CIFAR dataset (fig. 8) as well as top retrieved images for sample query images in the test set (fig. 9), and for the NUS-WIDE-LITE dataset (fig. 10), for different ground truth sizes.
Although the relative performance of the different methods varies depending on the reported set size, some trends are clear. Generally (though not always) BA beats all other methods, sometime by a significant margin. ITQ and SPH become close (sometimes comparable) to BA in CIFAR and NUS-WIDE dataset, respectively. BFA is also quite competitive, but consistently worse than BA. The only situation when the precision of BA and BFA appears to decrease is when is large and is small. The reason is that many test images have no neighbors at a Hamming distance of or less and we report zero precision for them. This suggests the hash function finds a way to avoid collisions as more bits are available. In practice, one would simply increase to retrieve sufficient neighbors.
Fig. 7 shows the results of BA using two different initializations, from the codes of AGH and ITQ, respectively. Although the initialization does affect the local optimum that the MAC algorithm finds, this optimum is always better than the initialization and it tends to give competitive results with other algorithms.
Fig. 11 shows the effective number of bits for all methods on two datasets, NUS-WIDE and ANNSIFT-1M, and should be compared with figure 6 for NUS-WIDE and figure 7 (top row, initialized from AGH) for ANNSIFT-1M. For each method, we show for the training set (solid line) and test set (dashed line). We also show a diagonal-horizontal line to indicate the upper bound (again, one such line for each of the training and test sets). All methods lie below this line, and the closer they are to it, the better their code utilization.
As mentioned in section 5, tPCA consistently gives the largest , and is by itself not a consistently reliable indicator of good precision/recall. However, there is a reasonably good correlation between and precision (with a few exceptions, notably tPCA), in that the methods display a comparable order in both measures. This is especially true if comparing with the precision plots corresponding to retrieving neighbors within a Hamming distance . This is particularly clear with the ANNSIFT-1M dataset.
Methods for which the precision drops for high (if using a small Hamming distance to retrieve neighbors) also show stagnating for those values. This is because the small number of points (hence possibly used codes) limits the entropy achievable. These methods are making good use of the available codes, having few collisions, so they need a larger Hamming distance to find neighbors. In particular, this explains the drop in precision of BA for small in fig. 7, which we also noted earlier. If the training or test set was larger, we would expect the precision and to continue to increase for those same values of .
The correlation in performance between and precision is also seen if comparing methods that are using the same model and (approximately) optimizing the same objective, such as the binary autoencoder reconstruction error for ITQ and MAC. The consistent improvement of MAC over ITQ in precision is seen in too. This further suggests that the better optimization effected by MAC is improving the hash function (in precision, and in code utilization).
We can see that if the number of points in the dataset (training or test) is small compared to the number of available codes , then stagnates as increases (since ). This is most obvious with the test set in ANNSIFT-1M, since it is small. Together with the precision/recall curves, can be used to determine the number of bits to use with a given database.
The leftmost plot of fig. 11 shows (for selected methods on NUS-WIDE) the actual code distribution as a histogram. That is, we plot the number of high-dimensional vectors that map to code , for each binary code that is used (i.e., that has at least one vector mapping to it). This is the code distribution of section 5 but unnormalized and without plotting the zero-count codes. The entropy of this distribution gives the value in the middle plot. High entropy corresponds to a large number of used codes, and uniformity in their counts.
One contribution of this paper is to reveal a connection between ITQ (Gong et al., 2013) (a popular, effective hashing algorithm) and binary autoencoders. ITQ can be seen as a fast, approximate optimization of the BA objective function, using a “filter” approach (relax the problem to obtain continuous codes, iteratively quantize the codes, then fit the hash function). Our BA algorithm is a corrected version of ITQ.
Admittedly, there are objective functions that are more suited for information retrieval than the autoencoder, by explicitly encouraging distances in the original and Hamming space to match in order to preserve nearest neighbors (Weiss et al., 2009; Liu et al., 2011; Kulis and Grauman, 2012; Lin et al., 2013, 2014). However, autoencoders do result in good hash functions, as evidenced by the good performance of ITQ and our method (or of semantic hashing (Salakhutdinov and Hinton, 2009), using neural nets). The reason is that, with continuous codes, autoencoders can capture the data manifold in a smooth way and indirectly preserve distances, encouraging (dis)similar images to have (dis)similar codes—even if this is worsened to some extent because of the quantization introduced with discrete codes. Autoencoders are also faster and easier to optimize and scale up better to large datasets.
Note that, although similar in some respects, the binary autoencoder is not a graphical model, in particular it is not a stacked restricted Boltzmann machine (RBM). A binary autoencoder composes twodeterministic
mappings: one (the encoder) with binary outputs, and another (the decoder) with binary inputs. Hence, the objective function for binary autoencoders is discontinuous. The objective function for RBMs is differentiable, but it involves a normalization factor that is computationally intractable. This results in a very different optimization: a nonsmooth optimization for binary autoencoders (with combinatorial optimization in the step over the codes if using MAC), and a smooth optimization using sampling and approximate gradients for RBMs. SeeVincent et al. (2010, p. 3372).
Up to now, many hashing approaches have essentially ignored the binary nature of the problem and have approximated it through relaxation and truncation, possibly disregarding the hash function when learning the binary codes. The inspiration for this work was to capitalize on the decoupling introduced by the method of auxiliary coordinates to be able to break the combinatorial complexity of optimizing with binary constraints, and to introduce parallelism into the problem. Armed with this algorithm, we have shown that respecting the binary nature of the problem during the optimization is possible in an efficient way and that it leads to better hash functions, competitive with the state-of-the-art. This was particularly encouraging given that the autoencoder objective is not the best for retrieval, and that we focused on linear hash functions.
The algorithm has an intuitive form (alternating classification, regression and binarization steps) that can reuse existing, well-developed code. The extension to nonlinear hash and reconstruction mappings is straightforward and it will be interesting to see how much these can improve over the linear case. This paper is a step towards constructing better hash functions using the MAC framework. We believe it may apply more widely to other objective functions.
Work supported by NSF award IIS–1423515. We thank Ming-Hsuan Yang and Yi-Hsuan Tsai (UC Merced) for helpful discussions about binary hashing.
Consider the following quadratic optimization with binary variables:
where is a symmetric matrix of of . In general, this is an NP-hard problem (Garey and Johnson, 1979). Beck and Teboulle (2000) gave global optimality conditions for this problem that are simply expressed in terms of the problem’s data (, ) involving only primal variables and no dual variables:
is the smallest eigenvalue of, is a vector of ones, is a diagonal matrix of with diagonal entries as in vector , and is a diagonal matrix of with diagonal entries . Intuitively, if is “smaller” than then we can disregard the quadratic term and trivially solve the separable, linear term.
Jeyakumar et al. (2007) further gave additional global optimality conditions:
If then is a global optimizer for (9), where .
If is a global optimizer for (9) then .
It is possible to give tighter conditions (Xia, 2009) but which are more complicated to compute.
Furthermore, Beck and Teboulle (2000) also gave conditions for the minimizer of the relaxed problem to be the global minimizer of the binary problem. Let the continuous relaxation be:
Furthermore, we have a relation when the solution of the binary problem is “close enough” to that of the relaxed one. If is an optimizer of the relaxed problem (10) and satisfies then is a global optimizer of the binary problem (9) (where if and if ).
In our case, we are particularly interested in the sufficient conditions, which we can combine as
Computationally, these conditions have a comparable cost to evaluating the objective function for an -bit vector (in the main paper). Besides, some of the arithmetic operations are common to evaluating and the necessary and sufficient conditions for global optimality of the binary problem, so we can use the latter as a fast test to determine whether we have found the global minimizer and stop the search (when enumerating all vectors). Likewise, we can use the necessary and sufficient conditions to determine whether the solution to the relaxed problem is the global minimizer for the discrete problem, since in our function the quadratic function is convex. Also, note that computing and is done just once for all data points in the training set, since all the problems (i.e., finding the binary codes for each data point) share the same matrix .
Proc. of the 17th Int. Conf. Artificial Intelligence and Statistics (AISTATS 2014), pages 10–19, Reykjavik, Iceland, Apr. 22–25 2014.
J. Machine Learning Research, 9:1871–1874, Aug. 2008.
Proc. of the 19th European Symposium on Artificial Neural Networks (ESANN 2011), Bruges, Belgium, Apr. 27–29 2011.
Fast supervised hashing with decision trees for high-dimensional data.In Proc. of the 2014 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’14), pages 1971–1978, Columbus, OH, June 23–28 2014.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.J. Machine Learning Research, 11:3371–3408, 2010.