Generative Semantic Hashing Enhanced via Boltzmann Machines

06/16/2020 ∙ by Lin Zheng, et al. ∙ Microsoft University at Buffalo SUN YAT-SEN UNIVERSITY 0

Generative semantic hashing is a promising technique for large-scale information retrieval thanks to its fast retrieval speed and small memory footprint. For the tractability of training, existing generative-hashing methods mostly assume a factorized form for the posterior distribution, enforcing independence among the bits of hash codes. From the perspectives of both model representation and code space size, independence is always not the best assumption. In this paper, to introduce correlations among the bits of hash codes, we propose to employ the distribution of Boltzmann machine as the variational posterior. To address the intractability issue of training, we first develop an approximate method to reparameterize the distribution of a Boltzmann machine by augmenting it as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Based on that, an asymptotically-exact lower bound is further derived for the evidence lower bound (ELBO). With these novel techniques, the entire model can be optimized efficiently. Extensive experimental results demonstrate that by effectively modeling correlations among different bits within a hash code, our model can achieve significant performance gains.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Similarity search, also known as nearest-neighbor search, aims to find items that are similar to a query from a large dataset. It plays an important role in modern information retrieval systems and has been used in various applications, ranging from plagiarism analysis (Stein et al., 2007) to content-based multimedia retrieval (Lew et al., 2006), etc

. However, looking for nearest neighbors in the Euclidean space is often computationally prohibitive for large-scale datasets (calculating cosine similarity with high-dimensional vectors is computationally-expensive). Semantic hashing circumvents this problem by representing semantically similar documents with compact and

binary codes. Accordingly, similar documents can be retrieved by evaluating the hamming distances of their hash codes much more efficiently.

To obtain similarity-preserving hash codes, extensive efforts have been made to learn hash functions that can preserve the similarity information of original documents in the binary embedding space (Shen et al., 2015; Liu et al., 2016)

. Existing methods often require the availability of label information, which is often expensive to obtain in practice. To avoid the use of labels, generative semantic hashing methods have been developed. Specifically, the variational autoencoder (VAE) is first employed for semantic hashing in

(Chaidaroon and Fang, 2017), and their model is termed VDSH. As a two-step process, the continuous document representations obtained from VAE are directly converted into binary hash codes. To resolve the two-step training problem, Bernoulli priors are leveraged as the prior distribution in NASH (Shen et al., 2018), replacing the continuous Gaussian prior in VDSH. By utilizing straight-through (ST) technique (Bengio et al., 2013), their model can be trained in an end-to-end manner, while keeping the merits of VDSH. Recently, to further improve the quality of hash codes, mixture priors are investigated in BMSH (Dong et al., 2019)

, while more accurate gradient estimators are studied in Doc2hash

(Zhang and Zhu, 2019), both under a similar framework as NASH.

Due to the training-tractability issue, the aforementioned generative hashing methods all assume a factorized variational form for the posterior, e.g., independent Gaussian in VDSH and independent Bernoulli in NASH, BMSH and Doc2hash. This assumption prevents the models from capturing dependencies among the bits of hash codes. Although uncorrelated bits are sometimes preferred in hashing, as reported in (Zhang and Li, 2014), this may not apply to generative semantic hashing. This is due to the fact that the independent assumption could severely limit a model’s ability to yield meaningful representations and thereby produce high-quality hash codes. Moreover, as the code length increases (to e.g. 128 bits), the number of possible codes (or simply the code space) will be too large for a dataset with limited number of data points. As a result, we advocate that correlations among bits of a hash code should be considered properly to restrict the embedding space, and thus enable a model to work effectively under a broad range of code lengths.

To introduce correlations among bits of hash codes, we propose to adopt the Boltzmann-machine (BM) distribution (Ackley et al., 1985) as a variational posterior to capture various complex correlations. One issue with this setting, relative to existing efficient training methods, is the inefficiency brought in training. To address this issue, we first prove that the BM distribution can be augmented as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Using this result, we then show that samples from BM distributions can be well reparameterized easily. To enable efficient learning, an asymptotically-exact lower bound of the standard evidence lower bound (ELBO) is further developed to deal with the notorious problem of the normalization term in Boltzmann machines. With the proposed reparameterization and the new lower bound, our model can be trained efficiently as the previous generative hashing models that preserve no bit correlations. Extensive experiments are conducted to evaluate the performance of the proposed model. It is observed that on all three public datasets considered, the proposed model achieves the best performance among all comparable models. In particular, thanks to the introduced correlations, we observe the performance of the proposed model does not deteriorate as the code length increases. This is surprising and somewhat contrary to what has been observed in other generative hashing models.

2 Preliminaries

Generative Semantic Hashing

In the context of generative semantic hashing, each document is represented by a sequence of words , where is the -th word and is denoted by a -dimensional one-hot vector; and denotes the document size (number of words) and the vocabulary size, respectively. Each document

is modeled by a joint probability:


where is a latent variable representing the document’s hash code. With the probability trained on a set of documents, the hash code for a document can be derived directly from the posterior distribution . In existing works, the likelihood function, or the decoder takes a form with


where is the matrix connecting the latent code and the one-hot representation of words; and is the one-hot vector with the only ‘1’ locating at the -th position. Documents could be modelled better by using more expressive likelihood functions, e.g.

, deep neural networks, but as explained in

Shen et al. (2018), they are more likely to destroy the crucial distance-keeping property for semantic hashing. Thus, the simple form of (2) is often preferred in generative hashing. As for the prior distribution

, it is often chosen as the standard Gaussian distribution as in VDSH

Chaidaroon and Fang (2017), or the Bernoulli distribution as in NASH and BMSH Shen et al. (2018); Dong et al. (2019).


Probabilistic models can be trained by maximizing the log-likelihood with . However, due to the intractability of calculating , we instead optimize its evidence lower bound (ELBO), i.e.,


where is the proposed variational posterior parameterized by . It can be shown that holds for any , and that if is closer to the true posterior , the bound will be tighter. Training then reduces to maximizing the lower bound w.r.t. and . In VDSH Chaidaroon and Fang (2017), takes the form of an independent Gaussian distribution


where and

are two vector-valued functions parameterized by multi-layer perceptrons (MLP) with parameters

. Later, in NASH and BMSH Shen et al. (2018); Dong et al. (2019), is defined as an independent Bernoulli distribution, i.e.,


where is also vector-valued function parameterized by a MLP. The value at each dimension represents the probability of being at that position. The MLP used to parameterize the posterior is also referred to as the encoder network.

One key requirement for efficient end-to-end training of generative hashing method is the availability of reparameterization for the variational distribution . For example, when is a Gaussian distribution as in (4), a sample from it can be efficiently reparameterized as


with . When is a Bernoulli distribution as in (5), a sample from it can be reparameterized as


where with elements . With these reparameterization tricks, the lower bound in (3) can be estimated by the sample as


where has been denoted as to explicitly indicate its dependence on

. To train these hashing models, the backpropagation algorithm can be employed to estimate the gradient of (

8) w.r.t. and easily. However, it is worth noting that in order to use the reparameterization trick, all existing methods assumed a factorized form for the proposed posterior , as shown in (4) and (5). This suggests that the binary bits in hash codes are independent of each other, which is not the best setting in generative semantic hashing.

3 Correlation-Enhanced Generative Semantic Hashing

In this section, we present a scalable and efficient approach to introducing correlations into the bits of hash codes, by using a Boltzmann-machine distribution as the variational posterior with approximate reparameterization.

3.1 Boltzmann Machine as the Variational Posterior

Many probability distributions defined over binary variables

are able to capture the dependencies. Among them, the most famous one should be the Boltzmann-machine distribution (Ackley et al., 1985), which takes the following form:


where and are the distribution parameters; and is the normalization constant. The Boltzmann-machine distribution can be adopted to model correlations among the bits of a hash code. Specifically, by restricting the posterior to the Boltzmann form


and substituting it into the lower bound of (3), we can write the lower bound as:


where ; and and are functions parameterized by the encoder network with parameters and as input. One problem with such modeling is that the expectation term in (11) cannot be expressed in a closed form due to the complexity of . Consequently, one cannot directly optimize the lower bound w.r.t. and .

3.2 Reparameterization

An alternative way is to approximate the expectation term by using the reparameterized form of a sample from , as was done in the previous uncorrelated generative hashing models (see (6) and (7)). Compared to existing simple variational distributions, there is no existing work on how to reparameterize the complicated Boltzmann-machine distribution. To this end, we first show that the Boltzmann-machine distribution can be equivalently written as the composition of an approximate correlated Gaussian distribution and a Bernoulli distribution.

Proposition 1.

A Boltzmann-machine distribution with can be equivalently expressed as the composition of two distributions, that is,


where ; with and denoting the -th element of and ; and with

being the sigmoid function.


See Appendix A.1 for details. ∎

Based on Proposition 1, we can see that a sample from the Boltzmann-machine distribution in (10) can be sampled hierarchically as




and is applied to its argument element-wise. From the expression of , we can see that for small values of , the influence of on the overall distribution is negligible, and thus can be well approximated by the Gaussian distribution . For relatively large , the term will only influence the distribution mean, roughly shifting the Gaussian distribution

by an amount approximately equal to its variance. For problems of interest in this paper, the variances of posterior distribution are often small, hence it is reasonable to approximate samples from

by those from .

With this approximation, we can now draw samples from Boltzmann-machine distribution in (10) approximately by the two steps below


For the Gaussian sample , similar to (6), it can be reparameterized as


where is the Cholesky decomposition matrix of with ; and with . It should be noted that in practice, we can define the function in advance and then obtain as , thus the Cholesky decomposition is not needed.

Given the Gaussian sample , similar to the reparameterization of Bernoulli variables in (7), we can reparameterize the Bernoulli sample as where with each element . By combining the above reparameterizations, a sample from the Boltzmann-machine distribution can then be approximately reparameterized as


where the subscript is to explicitly indicate that the sample is expressed in terms of .

With the reparameterization , the expectation term in (11) can be approximated as . Consequently, the gradients of this term w.r.t. both and can be evaluated efficiently by backpropagation, with the only difficulty lying at the non-differentiable function of in (18

). Many works have been devoted to estimate the gradient involving discrete random variables

(Bengio et al., 2013; Jang et al., 2017; Maddison et al., 2017; Tucker et al., 2017; Grathwohl et al., 2018; Yin and Zhou, 2019). Here, we adopt the simple straight-through (ST) technique Bengio et al. (2013), which has been found performing well in many applications. By simply treating the hard threshold function as the identity function, the ST technique estimates the gradient as


Then, the gradient of the first term in ELBO w.r.t. can be computed efficiently by backpropagation.

3.3 An Asymptotically-Exact Lower Bound

To optimize the ELBO in (11), we still need to calculate the gradient of , which is known to be notoriously difficult. A common way is to estimate the gradient by MCMC methods Tieleman (2008); Desjardins et al. (2010); Su et al. (2017a, b), which are computationally expensive and often of high variance. By noticing a special form of the ELBO (11), we develop a lower bound for the ELBO , where the term can be conveniently cancelled out. Specifically, we introduce another probability distribution and lower bound the original ELBO:


Since , we have holds for all , i.e., is a lower bound of , and equals to the ELBO when . For the choice of , it should be able to reduce the gap between and as much as possible, while ensuring that the optimization is tractable. Balancing on the two sides, a mixture distribution is used


where denotes the number of components; is the multivariate Bernoulli distribution and is the -th sample drawn from as defined in (14). By substituting into (20) and taking the expectation w.r.t. , we have


where . It can be proved that the bound gradually approaches the ELBO as increases, and finally equals to it as . Specifically, we have

Proposition 2.

For any integer , the lower bound of the ELBO satisfies the conditions: 1) ; 2) .


See Appendix A.2 for details. ∎

By substituting in (11) and in (21) into (22), the bound can be further written as


where the term is cancelled out since it appears in both terms but has opposite signs. For the first term in (3.3), as discussed at the end of Section 3.1, it can be approximated as . For the second term, each sample for can be approximately reparameterized like that in (17). Given the for , samples from can also be reparameterized in a similar way as that for Bernoulli distributions in (7). Thus, samples drawn from and are also reparameterizable, as detailed in Appendix A.3. By denoting this reparametrized sample as , we can approximate the second term in (3.3) as . Thus the lower bound (3.3) becomes


With the discrete gradient estimation techniques like the ST method, the gradient of w.r.t. and can then be evaluated efficiently by backpropagation. Proposition 2 indicates that the exact gets closer to the ELBO as increases, so better bound can be expected for the approximated as well when increases. In practice, a moderate value of is found to be sufficient to deliver a good performance.

3.4 Low-Rank Perturbation for the Covariance Matrix

In the reparameterization of a Gaussian sample, in (17), a matrix is required, with denoting the length of hash codes. The elements of are often designed as the outputs of neural networks parameterized by . Therefore, if is large, the number of neural network outputs will be too large. To overcome this issue, a more parameter-efficient strategy called Low-Rank Perturbation is employed, which restricts covariance matrix to the form


where is a diagonal matrix with positive entries and is a low-rank perturbation matrix with and . Under this low-rank perturbed , the Gaussian samples can be reparameterized as


where and . We can simply replace (17) with the above expression in any place that uses . In this way, the number of neural network outputs can be dramatically reduced from to .

4 Related Work

Semantic Hashing (Salakhutdinov and Hinton, 2009) is a promising technique for fast approximate similarity search. Locality-Sensitive Hashing, one of the most popular hashing methods (Datar et al., 2004), projects documents into low-dimensional hash codes in a randomized manner. However, the method does not leverage any information of data, and thus generally performs much worse than those data-dependent methods. Among the data-dependent methods, one of the mainstream methods is supervised hashing, which learns a function that could output similar hash codes for semantically similar documents by making effective use of the label information (Shen et al., 2015; Liu et al., 2016).

Different from supervised methods, unsupervised hashing pays more attention to the intrinsic structure of data, without making use of the labels. Spectral hashing (Weiss et al., 2009), for instance, learns balanced and uncorrelated hash codes by seeking to preserve a global similarity structure of documents. Self-taught hashing (Zhang et al., 2010), on the other hand, focuses more on preserving local similarities among documents and presents a two-stage training procedure to obtain such hash codes. In contrast, to generate high-quality hash codes, iterative quantization (Gong et al., 2013) aims to minimize the quantization error, while maximizing the variance of each bit at the same time.

Among the unsupervised hashing methods, the idea of generative semantic hashing has gained much interest in recent years. Under the VAE framework, VDSH (Chaidaroon and Fang, 2017) was proposed to first learn continuous the documents’ latent representations, which are then cast into binary codes. While semantic hashing is achieved with generative models nicely, the two-stage training procedure is problematic and is prone to result in local optima. To address this issue, NASH (Shen et al., 2018) went one step further and presented an integrated framework to enable the end-to-end training by using the discrete Bernoulli prior and the ST technique, which is able to estimate the gradient of functions with discrete variables. Since then, various directions have been explored to improve the performance of NASH. (Dong et al., 2019) proposed to employ the mixture priors to improve the model’s capability to distinguish documents from different categories, and thereby improving the quality of hash codes. On the other hand, a more accurate gradient estimator called Gumbel-Softmax (Jang et al., 2017; Maddison et al., 2017) is explored in Doc2hash (Zhang and Zhu, 2019) to replace the ST estimator in NASH. More recently, to better model the similarities between different documents, (Hansen et al., 2019) investigated the combination of generative models and ranking schemes to generate hash codes. Different from the aforementioned generative semantic hashing methods, in this paper, we focus on how to incorporate correlations into the bits of hash codes.

5 Experiments

5.1 Experimental Setup


Following previous works, we evaluate our model on three public benchmark datasets: i) Reuters21578, which consists of 10788 documents with 90 categories; ii) 20Newsgroups, which contains 18828 newsgroup posts from 20 different topics; iii) TMC, which is a collection of 21519 documents categorized into 22 classes.

Training Details

For the conveniences of comparisons, we use the same network architecture as that in NASH and BMSH. Specifically, a 2-layer feed-forward neural network with 500 hidden units and a ReLU activation function is used as an inference network, which receives the TF-IDF of a document as input and outputs the mean and covariance matrix of the Gaussian random variables

. During training, the dropout (Srivastava et al., 2014) is used to alleviate the overfitting issue, with the keeping probability selected from {0.8, 0.9} based on the performance on the validation set. The Adam optimizer (Kingma and Ba, 2014) is used to train our model, with the learning rate set to 0.001 initially and then decayed for every 10000 iterations. For all experiments on different datasets and lengths of hash codes, the rank of matrix is set to 10 and the number of component in the distribution is set to 10 consistently, although a systematic ablation study is conducted in Section 5.5 to investigate their impacts on the final performances.


The following unsupervised semantic hashing baselines are adopted for comparisons: Locality Sensitive Hashing (LSH) (Datar et al., 2004)

, Stack Restricted Boltzmann Machines (S-RBM)

(Salakhutdinov and Hinton, 2009), Spectral Hashing (SpH) (Weiss et al., 2009), Self-Taught Hashing (STH) (Zhang et al., 2010), Variational Deep Semantic Hashing (VDSH) (Chaidaroon and Fang, 2017), Neural Architecture for Generative Semantic Hashing (NASH) (Shen et al., 2018), and Semantic Hashing model with a Bernoulli Mixture prior (BMSH)(Dong et al., 2019).

Evaluation Metrics

The performance of our proposed approach is measured by retrieval precision i.e., the ratio of the number of relevant documents to that of retrieved documents. A retrieved document is said to be relevant if its label is the same as that of the query one. Specifically, during the evaluating phase, we first pick out top 100 most similar documents for each query document according to the hamming distances of their hash codes, from which the precision is calculated. The precisions averaged over all query documents are reported as the final performance.

5.2 Results of Generative Semantic Hashing

The retrieval precisions on datasets TMC, Reuters and 20Newsgroups are reported in Tables 1, 2 and 3, respectively, under different lengths of hash codes. Compared to the generative hashing method NASH without considering correlations, we can see that the proposed method, which introduces correlations among bits by simply employing the distribution of Boltzmann machine as the posterior, performs significantly better on all the three datasets considered. This strongly corroborates the benefits of taking correlations into account when learning the hash codes. From the tables, we can also observe that the proposed model even outperforms the BMSH, an enhanced variant of NASH that employs more complicated mixture distributions as a prior. Since only the simplest prior is used in the proposed model, larger performance gains can be expected if mixture priors are used as in BMSH. Notably, a recent work named RBSH is proposed in (Hansen et al., 2019), which improves NASH by specifically ranking the documents according to their similarities. However, since it employs a different data preprocessing technique as the existing works, we cannot include its results for a direct comparison here. Nevertheless, we trained our model on their preprocessed datasets and find that our method still outperforms it. For details about the results, please refer to Appendix A.4.

Moreover, when examining the retrieval performance of hash codes under different lengths, it is observed that the performance of our proposed method never deteriorates as the code length increases, while other models start to perform poorly after the length of codes reaching a certain level. For the most comparable methods like VDSH, NASH and BMSH, it can be seen that the performance of 128 bits is generally much worse than that of 64 bits. This phenomenon is illustrated more clearly in Figure 1. This may attribute to the reason that for hash codes without correlations, the number of codes will increase exponentially as the code length increases. Because the code space is too large, the probability of assigning similar items to nearby binary codes may decrease significantly. But for the proposed model, since the bits of hash codes are correlated to each other, the effective number of codes can be determined by the strength of correlations among bits, effectively restricting the size of code space. Therefore, even though the code length increases continually, the performance of our proposed model does not deteriorate.

Method 8 bits 16 bits 32 bits 64 bits 128 bits
LSH 0.4388 0.4393 0.4514 0.4553 0.4773
S-RBM 0.4846 0.5108 0.5166 0.5190 0.5137
SpH 0.5807 0.6055 0.6281 0.6143 0.5891
STH 0.3723 0.3947 0.4105 0.4181 0.4123
VDSH 0.4330 0.6853 0.7108 0.4410 0.5847
NASH 0.5849 0.6573 0.6921 0.6548 0.5998
BMSH n.a. 0.7062 0.7481 0.7519 0.7450
Ours 0.6959 0.7243 0.7534 0.7606 0.7632
Table 1: Precision of the top 100 retrieved documents on TMC dataset.
Method 8 bits 16 bits 32 bits 64 bits 128 bits
LSH 0.2802 0.3215 0.3862 0.4667 0.5194
S-RBM 0.5113 0.5740 0.6154 0.6177 0.6452
SpH 0.6080 0.6340 0.6513 0.6290 0.6045
STH 0.6616 0.7351 0.7554 0.7350 0.6986
VDSH 0.6859 0.7165 0.7753 0.7456 0.7318
NASH 0.7113 0.7624 0.7993 0.7812 0.7559
BMSH n.a. 0.7954 0.8286 0.8226 0.7941
Ours 0.7589 0.8212 0.8420 0.8465 0.8482
Table 2: Precision of the top 100 retrieved documents on Reuters dataset.
Method 8 bits 16 bits 32 bits 64 bits 128 bits
LSH 0.0578 0.0597 0.0666 0.0770 0.0949
S-RBM 0.0594 0.0604 0.0533 0.0623 0.0642
SpH 0.2545 0.3200 0.3709 0.3196 0.2716
STH 0.3664 0.5237 0.5860 0.5806 0.5443
VDSH 0.3643 0.3904 0.4327 0.1731 0.0522
NASH 0.3786 0.5108 0.5671 0.5071 0.4664
BMSH n.a. 0.5812 0.6100 0.6008 0.5802
Ours 0.4389 0.5839 0.6183 0.6279 0.6359
Table 3: Precision of the top 100 retrieved documents on 20Newsgroups dataset.
Figure 1: Retrieval precisions of unsupervised hashing methods on three datasets under different code lengths.

5.3 Empirical Study of Computational Efficiency

To show the computational efficiency of our proposed method, we also report the average running time per epoch in GPU on

TMC dataset, which is of the largest among the considered ones, in Table 4. As a benchmark, the average training time of vanilla NASH is s per epoch. It can be seen that because of to the use of low-rank parameterization of the covariance matrix, the proposed model can be trained almost as efficiently as vanilla NASH, but deliver a much better performance.

Value of Value of Avg. Time (seconds)
1 1 2.934
1 5 3.124
5 1 3.137
5 5 3.353
10 5 3.403
10 10 3.768
Table 4: Average running time per epoch on TMC dataset under different values of and .

5.4 Hash Codes Visualization

(a) VDSH
(b) NASH
(c) Ours
Figure 2: Visualization of the 128-bit hash codes learned by VDSH, NASH and our model on 20Newsgroups dataset respectively. Each data point in the figure above denotes a hash code of the corresponding document, and each color represents one category.

To further investigate the capability of different models in generating semantic-preserving binary codes, we project the hash codes produced by VDSH, NASH and our proposed model on 20Newsgroups datasets onto a two-dimensional plane by using the widely adopted UMAP technique (McInnes et al., 2018) and then visualize them on the two-dimensional planes, as shown in Figure 2. It can be seen that the hash codes produced by VDSH are quite mixed for documents from different categories, while those produced by NASH are more distinguishable, consistent with the hypothesis that NASH is able to produce better codes than VDSH thanks to the end-to-end training. From the figure, we can further observe that the hash codes produced by our proposed method are the most distinguishable among all three methods considered, corroborating the benefits of introducing correlations among the bits of hash codes.

5.5 Analyses on the Impacts of and


Low-rank perturbed covariance matrix enables the proposed model to trade-off between complexity and performance. That is, larger allows the model to capture more dependencies among latent variables, but the required computational complexity also increases. To investigate its impacts, we evaluate the performance of the 64-bit hash codes obtained from the proposed model under different values of , with the other key parameter fixed to 10. The result is listed in the left half of Table 5. Notably, the proposed model with is equivalent to NASH since there is not any correlation between the binary random variables. It can be seen that as the number of ranks increases, the retrieval precisions also increase, justifying the hypothesis that employing the posteriors with correlations can increase the model’s representational capacity and thereby improves the hash codes’ quality in turn. It is worth noting that the most significant performance improvement is observed between the models with and , and then as the value of continues to increase, the improvement becomes relatively small. This indicates that it is feasible to set the to a relatively small value to save computational resources while retaining competitive performance.

The number of mixture components

As stated in Section 3.3, increasing the number of components in the mixture distribution will reduce the gap between the lower bound and the ELBO . To investigate the impacts of , the retrieval precisions of the proposed model are evaluated under different values of , while setting the other key parameter . It can be seen from the right half of Table 5 that as the number of components increases, the retrieval precision also increases gradually, suggesting that a tighter lower bound can always indicate better hash codes. Hence, if more mixture components are used, better hash codes can be expected. Due to the sake of complexity, only 10 components are used at most in the experiments.

Value of Precision Value of Precision
0 0.7812 1 0.8300
1 0.8353 3 0.8391
5 0.8406 5 0.8395
10 0.8465 10 0.8465
Table 5: Left: Retrieval precisions under different values of with fixed to be 10 on Reuters dataset; Right: Retrieval precision under different values of with fixed to be 10 on Reuters dataset.

6 Conclusion

In this paper, by employing the distribution of Boltzmann machine as the posterior, we show that correlations can be efficiently introduced into the bits. To facilitate training, we first show that the BM distribution can be augmented as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Then, an asymptotically-exact lower bound of ELBO is further developed to tackle the tricky normalization term in Boltzmann machines. Significant performance gains are observed in the experiments after introducing correlations into the bits of hash codes.


This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61806223, U1711262, U1501252, U1611264, U1711261), National Key R&D Program of China (No. 2018YFB1004404), and Fundamental Research Funds for the Central Universities (No. 191gjc04). Also, CC appreciates the support from Yahoo! Research.


Appendix A Appendices

a.1 Proof of Proposition 1


Making use of completing the square technique, the joint distribution of

and can be decomposed as:


From above, we show that the marginal distribution is a Boltzmann machine distribution. ∎

a.2 Proof of Proposition 2

We show the following facts about the proposed lower bound of ELBO .

First, For any integer , we have . For brevity we denote as . First, due to the symmetry of indices, the following equality holds:

From this, we have



Applying the equality (27) gives us:

We now show that

. According to the strong law of large numbers,

converges to almost surely. We then have

Therefore, approaches as approaches infinity.

a.3 Derivation of reparameterization for

Recall that . We show that it can be easily reparameterized. Specifically, we could sample from such a mixture distribution through a two-stage procedure: (i) choosing a component from a uniform discrete distribution, which is then transformed as a -dimensional one-hot vector ; (ii) drawing a sample from the selected component, i.e. . Moreover, we define a matrix with its columns consisting of , each of which can be also reparameterized. In this way, a sample from the distribution can be simply expressed as

which can be seen as selecting a sample and then passing it through a perturbed sigmoid function. Therefore, during training, the gradients of are simply back-propagated through the chosen sample .

a.4 Comparisons between RBSH and our method

As discussed before, the main reason that we cited this paper but didn’t compare with it is that the datasets in (Hansen et al., 2019) are preprocessed differently as ours. Therefore, it is inappropriate to include the performance of the model from (Hansen et al., 2019) into the comparisons of our paper directly. Our work is a direct extension along the research line of VDSH and NASH. In our experiments, we followed their setups and used the preprocessed datasets that are publicized by them. However, in (Hansen et al., 2019), the datasets are preprocessed by themselves. The preprocessing procedure influences the final performance greatly, as observed in the reported results.

To see how our model performs compared to (Hansen et al., 2019), we evaluate our model on the 20Newsgroup and TMC datasets that are preprocessed by the method in (Hansen et al., 2019). The results are reported in Table 6, where RBSH is the model from (Hansen et al., 2019). We can see that using the same preprocessed datasets, our model overall performs better than RBSH, especially in the case of long codes. It should be emphasized that the correlation-introducing method proposed in this paper can be used with all existing VAE-based hashing models. In this paper, the base model is NASH, and when they are used together, we see a significant performance improvement. Since the RBSH is also a VAE-based hashing model, the proposed method can also be used with it to introduce correlations into the code bits, and significant improvements can also be expected.

Number of Bits 20Newsgroup TMC
8 0.5190 0.5393 0.7620 0.7667
16 0.6087 0.6275 0.7959 0.7975
32 0.6385 0.6647 0.8138 0.8203
64 0.6655 0.6941 0.8224 0.8289
128 0.6668 0.7005 0.8193 0.8324
Table 6: Precision of the top 100 received documents on 20Newsgroup and TMC datasets.