Learning-based binary hashing has become a powerful paradigm for fast search and retrieval in massive databases. However, due to the requirement of discrete outputs for the hash functions, learning such functions is known to be very challenging. In addition, the objective functions adopted by existing hashing techniques are mostly chosen heuristically. In this paper, we propose a novel generative approach to learn hash functions through Minimum Description Length principle such that the learned hash codes maximally compress the dataset and can also be used to regenerate the inputs. We also develop an efficient learning algorithm based on the stochastic distributional gradient, which avoids the notorious difficulty caused by binary output constraints, to jointly optimize the parameters of the hash function and the associated generative model. Extensive experiments on a variety of large-scale datasets show that the proposed method achieves better retrieval results than the existing state-of-the-art methods.READ FULL TEXT VIEW PDF
In this paper, we propose a novel deep generative approach to cross-moda...
Binary hashing is a well-known approach for fast approximate nearest-nei...
This paper proposes a novel ternary hash encoding for learning to hash
Hashing has been widely used for efficient similarity search based on it...
Most existing approaches to hashing apply a single form of hash function...
This paper presents an algorithm for structurally hashing directed graph...
We propose a novel machine-learning framework for dialogue modeling whic...
Learning to Hash
Search for similar items in web-scale datasets is a fundamental step in a number of applications, especially in image and document retrieval. Formally, given a reference dataset with , we want to retrieve similar items from for a given query according to some similarity measure . When the negative Euclidean distance is used, , , this corresponds to Nearest Neighbor Search (L2NNS) problem; when the inner product is used, , , it becomes a Maximum Inner Product Search (MIPS) problem. In this work, we focus on L2NNS for simplicity, however our method handles MIPS problems as well, as shown in the supplementary material D. Brute-force linear search is expensive for large datasets. To alleviate the time and storage bottlenecks, two research directions have been studied extensively: (1) partition the dataset so that only a subset of data points is searched; (2) represent the data as codes so that similarity computation can be carried out more efficiently. The former often resorts to search-tree or bucket-based lookup; while the latter relies on binary hashing or quantization. These two groups of techniques are orthogonal and are typically employed together in practice.
In this work, we focus on speeding up search via binary hashing. Hashing for similarity search was popularized by influential works such as Locality Sensitive Hashing (Indyk and Motwani, 1998; Gionis et al., 1999; Charikar, 2002). The crux of binary hashing is to utilize a hash function, , which maps the original samples in to
-bit binary vectorswhile preserving the similarity measure, , Euclidean distance or inner product. Search with such binary representations can be efficiently conducted using Hamming distance computation, which is supported via POPCNT on modern CPUs and GPUs. Quantization based techniques (Babenko and Lempitsky, 2014; Jegou et al., 2011; Zhang et al., 2014b) have been shown to give stronger empirical results but tend to be less efficient than Hamming search over binary codes (Douze et al., 2015; He et al., 2013).
Data-dependent hash functions are well-known to perform better than randomized ones (Wang et al., 2014). Learning hash functions or binary codes has been discussed in several papers, including spectral hashing (Weiss et al., 2009), semi-supervised hashing (Wang et al., 2010), iterative quantization (Gong and Lazebnik, 2011), and others (Liu et al., 2011; Gong et al., 2013; Yu et al., 2014; Shen et al., 2015; Guo et al., 2016). The main idea behind these works is to optimize some objective function that captures the preferred properties of the hash function in a supervised or unsupervised fashion.
Even though these methods have shown promising performance in several applications, they suffer from two main drawbacks: (1) the objective functions are often heuristically constructed without a principled characterization of goodness of hash codes, and (2) when optimizing, the binary constraints are crudely handled through some relaxation, leading to inferior results (Liu et al., 2014). In this work, we introduce Stochastic Generative Hashing (SGH) to address these two key issues. We propose a generative model which captures both the encoding of binary codes from input and the decoding of input from
. This provides a principled hash learning framework, where the hash function is learned by Minimum Description Length (MDL) principle. Therefore, its generated codes can compress the dataset maximally. Such a generative model also enables us to optimize distributions over discrete hash codes without the necessity to handle discrete variables. Furthermore, we introduce a novel distributional stochastic gradient descent method which exploits distributional derivatives and generates higher quality hash codes. Prior work on binary autoencoders(Carreira-Perpinán and Raziperchikolaei, 2015) also takes a generative view of hashing but still uses relaxation of binary constraints when optimizing the parameters, leading to inferior performance as shown in the experiment section. We also show that binary autoencoders can be seen as a special case of our formulation. In this work, we mainly focus on the unsupervised setting222The proposed algorithm can be extended to supervised/semi-supervised setting easily as described in the supplementary material E..
We start by first formalizing the two key issues that motivate the development of the proposed algorithm.
Generative view. Given an input , most hashing works in the literature emphasize modeling the forward process of generating binary codes from input, , , to ensure that the generated hash codes preserve the local neighborhood structure in the original space. Few works focus on modeling the reverse process of generating input from binary codes, so that the reconstructed input has small reconstruction error. In fact, the generative view provides a natural learning objective for hashing. Following this intuition, we model the process of generating from , and derive the corresponding hash function from the generative process. Our approach is not tied to any specific choice of but can adapt to any generative model appropriate for the domain. In this work, we show that even using a simple generative model (Section 2.1) already achieves the state-of-the-art performance.
Binary constraints. The other issue arises from dealing with binary constraints. One popular approach is to relax the constraints from (Weiss et al., 2009)
, but this often leads to a large optimality gap between the relaxed and non-relaxed objectives. Another approach is to enforce the model parameterization to have a particular structure so that when applying alternating optimization, the algorithm can alternate between updating the parameters and binarization efficiently. For example,(Gong and Lazebnik, 2011; Gong et al., 2012) imposed an orthogonality constraint on the projection matrix, while (Yu et al., 2014) proposed to use circulant constraints, and (Zhang et al., 2014a) introduced Kronecker Product structure. Although such constraints alleviate the difficulty with optimization, they substantially reduce the model flexibility. In contrast, we avoid such constraints and propose to optimize the distributions2.4
), which allows us to back-propagate through the layers of weights using the stochsastic gradient estimator.
Unlike (Carreira-Perpinán and Raziperchikolaei, 2015) which relies on solving expensive integer programs, our model is end-to-end trainable using distributional stochastic gradient descent (Section 3). Our algorithm requires no iterative steps unlike iterative quantization (ITQ) (Gong and Lazebnik, 2011). The training procedure is much more efficient with guaranteed convergence compared to alternating optimization for ITQ.
In the following sections, we first introduce the generative hashing model in Section 2.1. Then, we describe the corresponding process of generating hash codes given input , in Section 2.2. Finally, we describe the training procedure based on the Minimum Description Length (MDL) principle and the stochastic neuron reparametrization in Sections 2.3 and 2.4. We also introduce the distributional stochastic gradient descent algorithm in Section 3.
Unlike most works which start with the hash function , we first introduce a generative model that defines the likelihood of generating input given its binary code , , . It is also referred as a decoding function. The corresponding hash codes are derived from an encoding function , described in Section 2.2.
We use a simple Gaussian distribution to model the generation ofgiven :
and is a codebook with codewords. The prior
is modeled as the multivariate Bernoulli distribution on the hash codes, where. Intuitively, this is an additive model which reconstructs by summing the selected columns of given
, with a Bernoulli prior on the distribution of hash codes. The joint distribution can be written as:
This generative model can be seen as a restricted form of general Markov Random Fields in the sense that the parameters for modeling correlation between latent variables and correlation between and
are shared. However, it is more flexible compared to Gaussian Restricted Boltzmann machines(Krizhevsky, 2009; Marc’Aurelio and Geoffrey, 2010) due to an extra quadratic term for modeling correlation between latent variables. We first show that this generative model preserves local neighborhood structure of the when the Frobenius norm of is bounded. If is bounded, then the Gaussian reconstruction error, is a surrogate for Euclidean neighborhood preservation. Given two points , their Euclidean distance is bounded by
where and denote the binary latent variables corresponding to and , respectively. Therefore, we have
which means minimizing the Gaussian reconstruction error, , , will lead to Euclidean neighborhood preservation. A similar argument can be made with respect to MIPS neighborhood preservation as shown in the supplementary material D. Note that the choice of is not unique, and any generative model that leads to neighborhood preservation can be used here. In fact, one can even use more sophisticated models with multiple layers and nonlinear functions. In our experiments, we find complex generative models tend to perform similarly to the Gaussian model on datasets such as SIFT-1M and GIST-1M. Therefore, we use the Gaussian model for simplicity.
Even with the simple Gaussian model (1), computing the posterior is not tractable, and finding the MAP solution of the posterior involves solving an expensive integer programming subproblem. Inspired by the recent work on variational auto-encoder (Kingma and Welling, 2013; Mnih and Gregor, 2014; Gregor et al., 2014), we propose to bypass these difficulties by parameterizing the encoding function as
to approximate the exact posterior . With the linear parametrization, with . At the training step, a hash code is obtained by sampling from . At the inference step, it is still possible to sample . More directly, the MAP solution of the encoding function (3) is readily given by
This involves only a linear projection followed by a sign operation, which is common in the hashing literature. Computing in our model thus has the same amount of computation as ITQ (Gong and Lazebnik, 2011), except without the orthogonality constraints.
Since our goal is to reconstruct using the least information in binary codes, we train the variational auto-encoder using the Minimal Description Length (MDL) principle, which finds the best parameters that maximally compress the training data. The MDL principle seeks to minimize the expected amount of information to communicate :
where is the description length of the hashed representation and is the description length of having already communicated in (Hinton and Van Camp, 1993; Hinton and Zemel, 1994; Mnih and Gregor, 2014). By summing over all training examples , we obtain the following training objective, which we wish to minimize with respect to the parameters of and :
where and are parameters of the generative model as defined in (1), and comes from the encoding function defined in (3). This objective is sometimes called Helmholtz (variational) free energy (Williams, 1980; Zellner, 1988; Dai et al., 2016). When the true posterior falls into the family of (3), becomes the true posterior , which leads to the shortest description length to represent .
We emphasize that this objective no longer includes binary variables
as parameters and therefore avoids optimizing with discrete variables directly. This paves the way for continuous optimization methods such as stochastic gradient descent (SGD) to be applied in training. As far as we are aware, this is the first time such a procedure has been used in the problem of unsupervised learning to hash. Our methodology serves as a viable alternative to the relaxation-based approaches commonly used in the past.
Using the training objective of (4), we can directly compute the gradients w.r.t. parameters of . However, we cannot compute the stochastic gradients w.r.t. because it depends on the stochastic binary variables . In order to back-propagate through stochastic nodes of , two possible solutions have been proposed. First, the reparametrization trick (Kingma and Welling, 2013) which works by introducing auxiliary noise variables in the model. However, it is difficult to apply when the stochastic variables are discrete, as is the case for in our model. On the other hand, the gradient estimators based on REINFORCE trick (Bengio et al., 2013)
suffer from high variance. Although some variance reduction remedies have been proposed(Mnih and Gregor, 2014; Gu et al., 2015), they are either biased or require complicated extra computation in practice.
In next section, we first provide an unbiased estimator of the gradient w.r.t. derived based on distributional derivative, and then, we derive a simple and efficient approximator. Before we derive the estimator, we first introduce the stochastic neuron for reparametrizing Bernoulli distribution. A stochastic neuron reparameterizes each Bernoulli variable with
. Introducing random variables, the stochastic neuron is defined as
Because , we have . We use the stochastic neuron eqn:doubly_sn to reparameterize our binary variables by replacing with . Note that now behaves deterministically given . This gives us the reparameterized version of our original training objective (4):
where with . With such a reformulation, the new objective can now be optimized by exploiting the distributional stochastic gradient descent, which will be explained in the next section.
For the objective in eqn:reparam_helmholtz, given a point randomly sampled from , the stochastic gradient can be easily computed in the standard way. However, with the reparameterization, the function is no longer differentiable with respect to due to the discontinuity of the stochastic neuron . Namely, the SGD algorithm is not readily applicable. To overcome this difficulty, we will adopt the notion of distributional derivative for generalized functions or distributions (Grubb, 2008).
Let be an open set. Denote as the space of the functions that are infinitely differentiable with compact support in . Let be the space of continuous linear functionals on , which can be considered as the dual space. The elements in space are often called general distributions
. We emphasize this definition of distributions is more general than that of traditional probability distributions.
[Distributional derivative](Grubb, 2008) Let , then a distribution is called the distributional derivative of , denoted as , if it satisfies
It is straightforward to verify that for given , the function and moreover, , which is exactly the Dirac-
function. Based on the definition of distributional derivatives and chain rules, we are able to compute the distributional derivative of the function, which is provided in the following lemma. For a given sample , the distributional derivative of function w.r.t. is given by
where denotes point-wise product and denotes the finite difference defined as , where if , otherwise , . We can therefore combine distributional derivative estimators eq:new_grad_I with stochastic gradient descent algorithm (see e.g., (Nemirovski et al., 2009) and its variants (Kingma and Ba, 2014; Bottou et al., 2016)), which we designate as Distributional SGD. The detail is presented in Algorithm 1, where we denote
as the unbiased stochastic estimator of the gradient at constructed by sample . Compared to the existing algorithms for learning to hash which require substantial effort on optimizing over binary variables, the proposed distributional SGD is much simpler and also amenable to online settings (Huang et al., 2013; Leng et al., 2015).
In general, the distributional derivative estimator eq:new_grad_I requires two forward passes of the model for each dimension. To further accelerate the computation, we approximate the distributional derivative by exploiting the mean value theorem and Taylor expansion by
which can be computed for each dimension in one pass. Then, we can exploit this estimator
in Algorithm 1. Interestingly, the approximate stochastic gradient estimator of the stochastic neuron we established through the distributional derivative coincides with the heuristic “pseudo-gradient” constructed (Raiko et al., 2014). Please refer to the supplementary material A for details for the derivation of the approximate gradient estimator eq:new_grad_II.
One caveat here is that due to the potential discrepancy of the distributional derivative and the traditional gradient, whether the distributional derivative is still a descent direction and whether the SGD algorithm integrated with distributional derivative converges or not remains unclear in general. However, for our learning to hash problem, one can easily show that the distributional derivative in eq:new_grad_I is indeed the true gradient. The distributional derivative is equivalent to the traditional gradient . First of all, by definition, we have . One can easily verify that under mild condition, both and are continuous and -norm bounded. Hence, it suffices to show that for any distribution and , . For any , by definition of the distributional derivative, we have . On the other hand, we always have . Hence, for all . By the Du Bois-Reymond’s lemma (see Lemma 3.2 in (Grubb, 2008)), we have . Consequently, the distributional SGD algorithm enjoys the same convergence property as the traditional SGD algorithm. Applying theorem 2.1 in (Ghadimi and Lan, 2013), we arrive at Under the assumption that is -Lipschitz smooth and the variance of the stochastic distributional gradient eq:unbiased_full_grad is bounded by in the distributional SGD, for the solution sampled from the trajectory with probability where , we have
In fact, with the approximate gradient estimators eq:new_grad_II, the proposed algorithm is also converging in terms of first-order conditions, , Under the assumption that the variance of the approximate stochastic distributional gradient eq:biased_full_grad is bounded by , for the solution sampled from the trajectory with probability where , we have
The proposed stochastic generative hashing is a general framework. In this section, we reveal the connection to several existing algorithms.
Iterative Quantization (ITQ). If we fix some , and where
is formed by eigenvectors of the covariance matrix and
is an orthogonal matrix, we have. If we assume the joint distribution as
and parametrize , then from the objective in eqn:helmholtz and ignoring the irrelevant terms, we obtain the optimization
which is exactly the objective of iterative quantization (Gong and Lazebnik, 2011).
Binary Autoencoder (BA). If we use the deterministic linear encoding function, , , and prefix some , and ignore the irrelevant terms, the optimization eqn:helmholtz reduces to
which is the objective of a binary autoencoder (Carreira-Perpinán and Raziperchikolaei, 2015).
In BA, the encoding procedure is deterministic, therefore, the entropy term . In fact, the entropy term, if non-zero, performs like a regularization and helps to avoid wasting bits. Moreover, without the stochasticity, the optimization eqn:ba becomes extremely difficult due to the binary constraints. While for the proposed algorithm, we exploit the stochasticity to bypass such difficulty in optimization. The stochasticity enables us to accelerate the optimization as shown in section 5.2.
In this section, we evaluate the performance of the proposed distributional SGD on commonly used datasets in hashing. Due to the efficiency consideration, we conduct the experiments mainly with the approximate gradient estimator eq:new_grad_II. We evaluate the model and algorithm from several aspects to demonstrate the power of the proposed SGH: (1) Reconstruction loss. To demonstrate the flexibility of generative modeling, we compare the reconstruction error to that of ITQ (Gong and Lazebnik, 2011), showing the benefits of modeling without the orthogonality constraints. (2) Convergence of the distributional SGD. We evaluate the reconstruction error showing that the proposed algorithm indeed converges, verifying the theorems. (3) Training time. The existing generative works require a significant amount of time for training the model. In contrast, our SGD algorithm is very fast to train both in terms of number of examples needed and the wall time. (4) Nearest neighbor retrieval. We show Recall K@N plots on standard large scale nearest neighbor search benchmark datasets of MNIST, SIFT-1M, GIST-1M and SIFT-1B, for all of which we achieve state-of-the-art among binary hashing methods. (5) Reconstruction visualization. Due to the generative nature of our model, we can regenerate the original input with very few bits. On MNIST and CIFAR10, we qualitatively illustrate the templates that correspond to each bit and the resulting reconstruction.
We used several benchmarks datasets, , (1) MNIST which contains 60,000 digit images of size pixels, (2) CIFAR-10 which contains 60,000 pixel color images in 10 classes, (3) SIFT-1M and (4) SIFT-1B which contain and samples, each of which is a dimensional vector, and (5) GIST-1M which contains samples, each of which is a dimensional vector.
|(a) Reconstruction Error||(b) Training Time|
|Method||8 bits||16 bits||32 bits||64 bits|
Because our method has a generative model , we can easily compute the regenerated input , and then compute the loss of the regenerated input and the original , , . ITQ also trains by minimizing the binary quantization loss, as described in Equation (2) in (Gong and Lazebnik, 2011), which is essentially reconstruction loss when the magnitude of the feature vectors is compatible with the radius of the binary cube. We plotted the reconstruction loss of our method and ITQ on SIFT-1M in Figure 1(a) and on MNIST and GIST-1M in Figure 4, where the x-axis indicates the number of examples seen by the training algorithm and the y-axis shows the average reconstruction loss. The training time comparison is listed in Table 1. Our method (SGH) arrives at a better reconstruction loss with comparable or even less time compared to ITQ. The lower reconstruction loss demonstrates our claim that the flexibility of the proposed model afforded by removing the orthogonality constraints indeed brings extra modeling ability. Note that ITQ is generally regarded as a technique with fast training among the existing binary hashing algorithms, and most other algorithms (He et al., 2013; Heo et al., 2012; Carreira-Perpinán and Raziperchikolaei, 2015) take much more time to train.
We demonstrate the convergence of the distributional derivative with Adam (Kingma and Ba, 2014) numerically on SIFT-1M, GIST-1M and MINST from bits to bits. The convergence curves on SIFT-1M are shown in Figure 1 (a). The results on GIST-1M and MNIST are similar and shown in Figure 4 in supplementary material C. Obviously, the proposed algorithm, even with a biased gradient estimator, converges quickly, no matter how many bits are used. It is reasonable that with more bits, the model fits the data better and the reconstruction error can be reduced further.
In line with the expectation, our distributional SGD trains much faster since it bypasses integer programming. We benchmark the actual time taken to train our method to convergence and compare that to binary autoencoder hashing (BA) (Carreira-Perpinán and Raziperchikolaei, 2015) on SIFT-1M, GIST-1M and MINST. We illustrate the performance on SIFT-1M in Figure 1(b) . The results on GIST-1M and MNIST datasets follow a similar trend as shown in the supplementary material C. Empirically, BA takes significantly more time to train on all bit settings due to the expensive cost for solving integer programming subproblem. Our experiments were run on AMD 2.4GHz Opteron CPUs
and 32G memory. Our implementation of the stochastic neuron as well as the whole training procedure was done in TensorFlow. We have released our code on GitHub333https://github.com/doubling/Stochastic_Generative_Hashing. For the competing methods, we directly used the code released by the authors.
We compared the stochastic generative hashing on an L2NNS task with several state-of-the-art unsupervised algorithms, including -means hashing (KMH) (He et al., 2013), iterative quantization (ITQ) (Gong and Lazebnik, 2011), spectral hashing (SH) (Weiss et al., 2009), spherical hashing (SpH) (Heo et al., 2012), binary autoencoder (BA) (Carreira-Perpinán and Raziperchikolaei, 2015), and scalable graph hashing (GH) (Jiang and Li, 2015). We demonstrate the performance of our binary codes by doing standard benchmark experiments of Approximate Nearest Neighbor (ANN) search by comparing the retrieval recall. In particular, we compare with other unsupervised techniques that also generate binary codes. For each query, linear search in Hamming space is conducted to find the approximate neighbors.
Following the experimental setting of (He et al., 2013), we plot the Recall10@N curve for MNIST, SIFT-1M, GIST-1M, and SIFT-1B datasets under varying number of bits (16, 32 and 64) in Figure 2. On the SIFT-1B datasets, we only compared with ITQ since the training cost of the other competitors is prohibitive. The recall is defined as the fraction of retrieved true nearest neighbors to the total number of true nearest neighbors. The Recall10@N is the recall of 10 ground truth neighbors in the N retrieved samples. Note that Recall10@N is generally a more challenging criteria than Recall@N (which is essentially Recall1@N), and better characterizes the retrieval results. For completeness, results of various Recall K@N curves can be found in the supplementary material which show similar trend as the Recall10@N curves.
shows that the proposed SGH consistently performs the best across all bit settings and all datasets. The searching time is the same for the same number of bits, because all algorithms use the same optimized implementation of POPCNT based Hamming distance computation and priority queue. We point out that many of the baselines need significant parameter tuning for each experiment to achieve a reasonable recall, except for ITQ and our method, where we fix hyperparameters for all our experiments and used a batch size ofand learning rate of with stepsize decay. Our method is less sensitive to hyperparameters.
One important aspect of utilizing a generative model for a hash function is that one can generate the input from its hash code. When the inputs are images, this corresponds to image generation, which allows us to visually inspect what the hash bits encode, as well as the differences in the original and generated images.
In our experiments on MNIST and CIFAR-10, we first visualize the “template” which corresponds to each hash bit, , each column of the decoding dictionary
. This gives an interesting insight into what each hash bit represents. Unlike PCA components, where the top few look like averaged images and the rest are high frequency noise, each of our image template encodes distinct information and looks much like filter banks of convolution neural networks. Empirically, each template also looks quite different and encodes somewhat meaningful information, indicating that no bits are wasted or duplicated. Note that we obtain this representation as a by-product, without explicitly setting up the model with supervised information, similar to the case in convolution neural nets.
We also compare the reconstruction ability of SGH with the that of ITQ and real valued PCA in Figure 3. For ITQ and SGH, we use a -bit hash code. For PCA, we kept 64 components, which amounts to bits. Visually comparing with SGH, ITQ reconstructed images look much less recognizable on MNIST and much more blurry on CIFAR-10. Compared to PCA, SGH achieves similar visual quality while using a significantly lower ( less) number of bits!
In this paper, we have proposed a novel generative approach to learn binary hash functions. We have justified from a theoretical angle that the proposed algorithm is able to provide a good hash function that preserves Euclidean neighborhoods, while achieving fast learning and retrieval. Extensive experimental results justify the flexibility of our model, especially in reconstructing the input from the hash codes. Comparisons with approximate nearest neighbor search over several benchmarks demonstrate the advantage of the proposed algorithm empirically. We emphasize that the proposed generative hashing is a general framework which can be extended to semi-supervised settings and other learning to hash scenarios as detailed in the supplementary material. Moreover, the proposed distributional SGD with the unbiased gradient estimator and its approximator can be applied to general integer programming problems, which may be of independent interest.
LS is supported in part by NSF IIS-1218749, NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF IIS-1639792 EAGER, ONR N00014-15-1-2340, NVIDIA, Intel and Amazon AWS.
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, 2002. ‘’
Provable bayesian inference via particle mirror descent.In
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 985–994, 2016.
Learning binary codes for high-dimensional data using bilinear projections.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 484–491, 2013.
Proceedings of The 31st International Conference on Machine Learning, pages 1242–1250, 2014.
Proceedings of the sixth annual conference on Computational learning theory, pages 5–13. ACM, 1993.
Approximate nearest neighbors: towards removing the curse of dimensionality.In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613. ACM, 1998.
Semi-supervised hashing for scalable image retrieval.In Computer Vision and Pattern Recognition (CVPR), 2010.
(Chain Rule I) The distribution derivative of for any is given by .
(Chain Rule II) The distribution derivative of for any with bounded is given by .
Proof of Lemma 3.1. Without loss of generality, we first consider -dimension case. Given , , . For , we have
where the last equation comes from . We obtain
We generalize the conclusion to -dimension case with expectation over , , , we have the partial distributional derivative for -th coordinate as
Therefore, we have the distributional derivative w.r.t. as
|chain rule I|
To derive the approximation of the distributional derivative, we exploit the mean value theorem and Taylor expansion. Specifically, for a continuous and differential loss function, there exists
Moreover, for general smooth functions, we rewrite the by Taylor expansion, ,
we have an approximator as
Plugging into the distributional derivative estimator eq:new_grad_I, we obtain a simple biased gradient estimator,
(Ghadimi and Lan, 2013) Under the assumption that is -Lipschitz smooth and the variance of the stochastic distributional gradient eq:unbiased_full_grad is bounded by , the proposed distributional SGD outputs ,
Under the assumption that the variance of the approximate stochastic distributional gradient eq:biased_full_grad is bounded by , the proposed distributional SGD outputs such that
where denotes the optimal solution.
Denote the optimal solution as , we have
Taking expectation on both sides and denoting , we have
|(a) MNIST||(b) GIST-1M|
We shows the reconstruction error comparison between ITQ and SGH on MNIST and GIST-1M in Figure 4. The results are similar to the performance on SIFT-1M. Because SGH optimizes a more expressive objective than ITQ (without orthogonality) and do not use alternating optimization, it find better solution with lower reconstruction error.
|(a) MNIST||(b) GIST-1M|
We shows the training time comparison between BA and SGH on MNIST and GIST-1M in Figure 5. The results are similar to the performance on SIFT-1M. The proposed distributional SGD learns the model much faster.
We also use different RecallK@N to evaluate the performances of our algorithm and the competitors. We first evaluated the performance of the algorithms with Recall 1@N in Figure 6. This is an easier task comparing to . Under such measure, the proposed SGH still achieves the state-of-the-art performance.
In Figure 7, we set and plot the recall by varying the length of the bits on MNIST, SIFT-1M, and GIST-1M. This is to show the effects of length of bits in different baselines. Similar to the Recall10@N, the proposed algorithm still consistently achieves the state-of-the-art performance under such evaluation measure.
|(a) L2NNS on MNIST||(b) L2NNS on SIFT-1M||(c) L2NNS on GIST-1M|
In Maximum Inner Product Search (MIPS) problem, we evaluate the similarity in terms of inner product which can avoid the scaling issue, , the length of the samples in reference dataset and the queries may vary. The proposed model can also be applied to the MIPS problem. In fact, the Gaussian reconstruction model also preserve the inner product neighborhoods. Denote the asymmetric inner product as , we claim The Gaussian reconstruction error is a surrogate for asymmetric inner product preservation. We evaluate the difference between inner product and the asymmetric inner product,
which means minimizing the Gaussian reconstruction, , , error will also lead to asymmetric inner product preservation. We emphasize that our method is designed for hashing problems primarily. Although it can be used for MIPS problem, it is different from the product quantization and its variants whose distance are calculated based on lookup table. The proposed distributional SGD can be extended to quantization. This is out of the scope of this paper, and we will leave it as the future work.
To evaluate the performance of the proposed SGH on MIPS problem, we tested the algorithm on WORD2VEC dataset for MIPS task. Besides the hashing baselines, since KMH is the Hamming distance generalization of PQ, we replace the KMH with product quantization (Jegou et al., 2011). We trained the SGH with 71,291 samples and evaluated the performance with 10,000 query. Similarly, we vary the length of binary codes from , to , and evaluate the performance by Recall 10@N. We calculated the ground-truth via retrieval through the original inner product. The performances are illustrated in Figure 8. The proposed algorithm outperforms the competitors significantly, demonstrating the proposed SGH is also applicable to MIPS task.
We generalize the basic model to translation and scale invariant extension, semi-supervised extension, as well as coding with .
As we known, the data may not zero-mean, and the scale of each sample in dataset can be totally different. To eliminate the translation and scale effects, we extend the basic model to translation and scale invariant reduced-MRFs by introducing parameter to separate the translation effect and the latent variable to model the scale effect in each sample , therefore, the potential function becomes
where denotes element-wise product, and . Comparing to eqn:reduced_mrf, we replace with so that the translation and scale effects in both dimension and sample are modeled explicitly.
We treat the as parameters and as latent variable. Assume the independence in posterior for computational efficiency, we approximate the posterior with , where denotes the parameters in the posterior approximation. With similar derivation, we obtain the learning objective as
Obviously, the proposed distributional SGD is still applicable to this optimization.
Although we only focus on learning the hash function in unsupervised setting, the proposed model can be easily extended to exploit the supervision information by introducing pairwise model, , (Zhang et al., 2014a; Zhu et al., 2016). Specifically, we are provided the (partial) supervision information for some pairs of data, , , where
and stands for the set of nearest neighbors of . Besides the original Gaussian reconstruction model in the basic model in eqn:reduced_mrf, we introduce the pairwise model into the framework, which results the joint distribution over as
where is an indicator that outputs when , otherwise . Plug the extended model into the Helmholtz free energy, we have the learning objective as,