Nearest neighbor (NN) search has attracted increasing interest due to the ever-growing large-scale data on the web, which is a fundamental requirement in image retrieval . Recently, similarity-preserving hashing methods that encode images into binary codes have been widely studied. Learning good hash functions should require two principles: 1) powerful image representation and 2) efficient searching in representation space. In this paper, we focus on deep-network-based hashing for efficient searching and keep the good performance.
In recent year, we have witnessed great success of deep neural networks, which the success mainly comes from the powerful representation learned from the deep network architectures. The deep-networks-based hashing methods learn the image representations as well as the binary hash codes. Lin et al. proposed a method that learns hash codes and image representations in a point-wised manner. Li et al.  proposed a novel deep hashing method called deep pairwise-supervised hashing (DPSH) to perform simultaneous hash code learning and feature learning. Zhao et al.  presented a deep semantic ranking based method for learning hash functions that preserve multilevel semantic similarity for multi-label images. Further, Zhuang  proposed a fast deep network for triplet supervised hashing.
Although the powerful binary codes have been learned from the deep networks, linear scan of Hamming distance is also time consuming in front of large-scale dataset (e.g., millions or billions images). Many methods have been proposed for efficient searching in Hamming space. One popular approach is to use binary codes as the indices into a hash table . The problem is that the number of buckets grows near-exponentially. Norouzi et al.  proposed multi-index hashing (MIH) method for fast searching, which divides the binary codes into smaller substrings and build multiple hash tables. MIH assumes that binary codes are uniformly distributed over the Hamming space, which is always not true. Liu et al.  and Wan et al.  proposed data-oriented multi-index hashing, respectively. They firstly calculated the correlation matrix between bits and then rearranged the indices of bits to make a more uniform distribution in each hash table. Ong et al.  relaxed the equal-size constraint in MIH and proposed multiple hash tables with variable length hash keys. Wang et al.  used repeat-bits in Hamming space to accelerate the searching but need more storage space. Song et al.  proposed a distance-computation-free search scheme for hashing.
Most of the existing works firstly use the hashing models (e.g., LSH , MLH ) to encode the image into the binary codes, followed by separate methods to rebalance the binary codes distribution. Such fixed hashing models may result in suboptimal searching. Ideally, it is expected that hash models and balanced procedure can be learned simultaneously during the hash learning process.
In this paper, we propose a deep architecture for fast searching and efficient image representation by incorporating the MIH approach into the network. As shown in Figure. 1, our architecture consists of three main building blocks. The first block is for learning the good image representation by the stacked convolutional and fully-connected layers followed by a slice layer which divides intermediate image features into multiple substrings, each substring corresponding to one hash table as the MIH approach. And the second and third blocks learn the uniform codes distribution, which balances the binary codes in feature-level and instance-level, respectively. In feature-level, we make the bits distributed as uniform as possible in each substring hash table by adding a new balanced constraint in the objective. And the instance-level is used to punish the buckets contain too many items, which will cost much time for checking many candidate codes. Finally, a similarity-preserving objective with two balanced constraints is proposed to capture the similarities among the images, and a fast hash model is learned to encode all the images into more uniformed binary codes.
The main contributions of this work are two-folds.
We propose a deep multi-index hashing, which learns the hash functions for both the powerful image representation and fast searching.
We conduct extensive evaluations on several benchmark datasets. The empirical results demonstrate the superiority of the proposed method over the state-of-the-art baseline methods.
2 Related Work
The learning-to-hash methods learn the hash functions from the training data for generating better binary representation. The representative methods include Iterative Quantization (ITQ) , Kernerlized LSH (KLSH) , Anchor Graph Hashing (AGH) , Spectral Hashing (SH) , Semi-Supervised Hashing (SSH) , Kernel-based Supervised Hashing (KSH) , Minimal Loss Hashing (MIH) , Binary Reconstruction Embedding (BRE)  and so on. The comprehensive survey can be found in .
have been proposed, including the point-wise approach, the pair-wise approach and the ranking-based approach. The point-wise methods take a single image as input and the loss function is built on individual data. For example, Lin et al. showed that the binary codes can be learned by employing a hidden layer for representing the latent concepts that dominate the class labels, thus they proposed to learn the hash codes and image representations in a point-wised manner. Yang et al.  proposed a loss function defined on classification error and other desirable properties of hash codes to learn the hash functions. The pair-wise methods take the image pairs as input and the loss functions are used to characterize the relationship (i.e., similar or dissimilar) between a pair of two images. Specifically, if two images are similar, then the hamming distance between the two images should be small, otherwise, the distance should be large. Representative methods include deep pairwise-supervised hashing (DPSH) , deep supervised hashing (DSH)  and so on. The ranking-based methods cast learning-to-hash as the ranking problem. Zhao et al.  proposed a deep semantic ranking-based method for learning hash functions that preserve multi-level semantic similarity between multi-label images. Zhuang  proposed a fast deep network for triplet supervised hashing.
Although obtaining the powerful image representation via the deep learning-to-hash methods, existing works always do not consider the fast searching in the learned codes space. Multi-index hashing[4, 22] is an efficient method for finding all -neighbors of a query by dividing the binary codes into multiple substrings. While, binary codes learned from the deep network always not be uniformly distributed in practice, e.g., all images with the same label indices with a similar key as shown in Figure 2, which will cost much time to check many candidate codes. In this paper, we solve this problem by adding two balanced constraints in our network, and learn more uniformly distributed binary codes.
3 Background: Multi-Index Hashing
In this section, we briefly review MIH .
MIH is a method for fast searching in large-scale datasets, which the binary code is partitioned into disjoint substring, , each substring consists of bits, where is the length of bits and we assume is divisible by for convenience presentation. One hash table is built for each of the substrings.
The -neighbor of a query is denoted as which differ from in bits or less from all codes in the database. To search the -neighbor of a query with substrings , MIH searches the all substring hash tables for entries that are within a Hamming distance of 111For ease of presentation, here we assume is divisible by . In practice, if with , we can set the search radii of the first hash tables to be and the rests to be .. The set of candidates from the -th substring hash table is denoted as . Then, the union of all the sets, , is the superset of the -neighbors of . The false positives that are not true -neighbors of are removed by computing the full Hamming distance.
The NN search problem can be formulated as the -near neighbor problem. By initializing integer , we can progressive increment of the search radius until the specified number of neighbors is found.
4 Deep Multi-Index Hashing
This section describes deep multi-index hashing architecture that allows us to 1) obtain powerful binary codes and 2) efficient search inside the binary codes space.
We firstly introduce notations. There is a labeled training set , where is the -th image, is the class name/label of the -th image, and the number of training samples is . Suppose that each binary code comprises bits, the goal of deep multi-index hashing is to learn a deep hash model, in which the similarities among the binary codes should be preserved and also quick searching in large-scale binary codes space.
As shown in Figure 1, the purposes of the proposed architecture are two: 1) a deep network with multiple convolution-pooling layers to capture an efficient representation of images, and followed by a slice layer to partition the feature into disjoint substrings, and 2) a balanced codes module designed to address the ability to quickly search inside the binary codes space. It generates the binary codes distributed as uniform as possible in each substring hash table from two aspects: feature-level and instance-level. In the following, we will present the details of these parts, respectively.
4.1 Efficient Representation via Deep Network
The deep network, e.g., AlexNet , VGG , GoogleLeNet  and residual network , is used for learning the powerful efficient image representation, which is made following structural modifications for image retrieval task. The first modification is to remove the last fully-connected layer (e.g., fc8). The second is to add a fully-connected layer with dimensional intermediate features. The intermediate features are then fed into a tanh layer that restricts the values in the range . The MIH contains separate hash tables. Inspired by that, the third modification is to add a slice layer to divide the features into slices with equal length . According to the suggestion of the MIH, the number of substring hash tables is setted to be , which shows the best empirical performance shown in . Finally, the output of network is denoted as , where is the input image and is the deep network.
The deep-network based methods can learn a very powerful image representation for the image retrieval, while they do not consider the ability to efficiently search inside the representation space. An example of powerful image representation for binary codes while bad searching is shown in Figure 2. Here the substring of length . Suppose that there are 2 class labels, and each class consists of 50,000 images. Without loss of generality, the first 50,000 images whose labels are , the labels of the rest 50,000 images are . The hash table is built for the 100,000 learned binary codes shown in Figure 2, where the similar codes locate in the same bucket (with a similar key) and the dissimilar codes have largest Hamming distance, i.e., , in the hash table. The learned binary codes are very good for accuracy while they are very bad for searching. Given a query, it needs to check so many candidate items (e.g., 50,000 items). It is necessary for finding a new way to generate more balanced binary codes.
4.2 Fast Searching via the Deep Multi-Index Hashing
We first give the following proposition.
When the buckets in the substring hash tables that differ from within bits, i.e., , then we have , where .
For example, suppose that , when searching in the first substring hash table, we obtain a set of candidates , then the -neighbor (i.e., ) of query is the subset of the candidates, that is . Similar, we have , and etc. When searching in the substring hash tables differs by bits or less, we can obtain all -neighbor of the query, where .
According to the above proposition, we can see that the running time of MIH for NN mainly contains two parts: index lookups and candidate codes checking. To achieve faster searching, we should reduce 1) the number of distinct hash buckets to examine, i.e., the smaller , the better. 2) the number of candidate codes to check, i.e., the smaller , the better.
4.2.1 Balanced binary codes in Instance-level
To reduce the running time for index lookups, the binary codes of similar images should be indices with a similar key as shown in Figure 2. In such case, . Unfortunately, we need to check so many candidate binary codes, making the inefficient searching. Thus, the number of each bucket should be not too small and not too large. Balanced binary codes in instance-level are learned for addressing the problem, which require that each bucket in the hash table contains at most items.
Formally, all the buckets in all substring hash tables that have more than items were found. Let is denoted as the items in the -th bucket of the -th substring hash table. We use the following steps to rebalance these items as shown in Figure 3.
1) The full Hamming distance is used to split these items into several groups, each group contains the samples which have the same binary codes. If the number of all groups are less than , stop the procedure.
2) Otherwise, if the number of the group is more than , we further randomly split it into subgroups with the equal sizes, making sure each subgroup consists of at most items, where is the number of items in the group.
A key principle should be ensured is that do not change the similarities among these images, that is the distance between and , i.e., , should preserve relative similarities of the form “( in the same subgroup) ( in the same group) ( in the different groups)”. Thus, the objective can be formulated as:
where we let the Hamming distances of -th substring between the examples in from 0 to be 0,1,2 to rebalance the items.
4.2.2 Balanced binary codes in Feature-level
To reduce the running time for candidate codes checking, the false positives in candidate set should be small, that is to minimize . To achieve it, the should not contains too many items which are not true -neighbors of query. That is, when the substring and differ by bits, the full Hamming distant between and should differ by bits or less. This leads to the follow proposition:
Suppose that for all and , we have , and , where , then .
According to Proposition 2, we add the following new balanced constraint in our objective
where . The above formulation requires the almost equal distance in each substring, which distance of each substring should be less or equal to and larger or equal to .
Overall, the similarity-preserving loss function for balanced codes can be formulated as :
where are parameters, is the distance between two binary codes. For ease of optimization, we replace the Hamming distance with the euclidean distance. In all our experiments, the is setted to be , , and . The first term of the objective is to preserves relative similarities of the form “ is more similar to than to ”. The second term is for generating balanced codes in feature-level and the third term is for balanced codes in instant-level.
In this section, we evaluate and compare the performance of the proposed method with several state-of-the-art algorithms.
5.1 Datasets and Experimental Setting
In NUS-WIDE, we follow the settings in [33, 18] for fair comparison. The 21 most frequent labels are selected, where each label associates with at least 5,000 images. We randomly select 100 images from each of the selected 21 classes to form the query set of 2,100 images. The rest images are used as the retrieval database. In the retrieval database, 500 images from each of the selected 21 classes are randomly chosen as the training set.
In SVHN, we randomly select 1,000 images (100 images per class) as the query set, and 5,000 images (500 images per class) from the rest images as the training set.
We implement the proposed method using the open-source Caffe  framework. In this paper, we use AlexNet  as our basic network. The weights of layers are firstly initialized by the pre-trained AlexNet model 444http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel.
In this subsection, we evaluate the query time of our method by comparing it with the existing deep-network-based method. To make a fair comparison, we compare two methods:
DeepHash. The hash functions are learned without the assistance of the balanced constraints, i.e., only use the first term of the objective (2).
Deep Multi-Index Hashing (DMIH). The hash functions are learned with the assistance of the balanced constraints, i.e., using all terms in the objective (2).
Since the two methods use the same network and the only difference is that using or not using the proposed balanced constraints in feature-level and instance-level, these comparisons can show us whether the balanced constraints can contribute to the speed or not.
After obtaining the binary codes, we use the implementation of MIH 555https://github.com/norouzi/mih provided by the authors to report the accelerated ratios of the two methods compared to linear scan on all the above databases. The speed-up factors of MIH over linear scan of both the proposed method and DeepHash for different NN problems are shown in Table 1. Note that the linear scan does not depend on the underlying distribution of the binary codes, thus the running times of linear scan of two methods are the same.
The results show that DMIH is more efficient than DeepHash, especially for the small NN problems. For instance, for -NN in 96-bits codes on SVHN, the speed-up factor for DMIH is 92.46, compared to 9.96 for DeepHash. In NUS-WIDE, our method shows about speed-up ratio in comparison with DeepHash.
The main reason is that the proposed method can learn the more balanced hash codes than DeepHash. To give an intuitive understanding of our method, we utilize an entropy based measurement that is defined as
where is the dimension of the -th hash table, thus there are buckets in this table. And
is the probability of codes assigned to bucket, which is defined as , where is the number of codes in bucket and is the size of database. Note that the higher entropy value means better distribution of data items in hash tables.
|Method||64 bits||96 bits||128 bits||256 bits|
Again, for all databases and bits, our method yields the higher entropy and beats the baseline. This is also can explain why our method can obtain the faster searching.
Further, we evaluate and compare the performance of the proposed method with several state-of-the-art algorithms. LSH , ITQ , ITQ-CCA , SH  and DeepHash are selected as the baselines. The results of LSH, ITQ, ITQ-CCA and SH are obtained by the implementations provided by their authors, respectively. Note that DeepHash is very similar to the existing work One-Stage Hash 
, which also divides the feature into several slices and uses the triplet ranking loss for preserving the similarities. Since the results of DeepHash and One-stage are almost the same, thus we only report the results of DeepHash. To evaluate the quality of hashing, we use Mean Average Precision (MAP) and Precision curves w.r.t. different numbers of top returned samples as the evaluation metrics. For a fair comparison, all of the methods use identical training and test sets, and the AlexNet model is used to extract deep features (i.e., 4096 dimensional features from fc7 layer) for LSH, ITQ, ITQ-CCA and SH.
Figure 4 shows the comparison results on the two datasets. We can see that 1) the deep network based methods show improvements over the baselines using fixed deep features. 2) DMIH shows comparable performance against the most related baseline DeepHash. These results verify that adding new balanced constraints does not drop the performance.
In summary, our method performs 2 to 10 times faster than DeepHash with the comparable performance.
5.2.1 Effects of the Feature-level and Instance-level Constraints
In this set of experiments, we show the advantages of the proposed two balanced constraints. To give an intuitive comparison, we show the results of using only the feature-level/instance-level constraint, respectively.
Table 3 show the comparison results. The results show that instance-level constraint is very useful for SVHN while the feature-level constraint is helpful in NUS-WIDE dataset. It depends on the data distributions of the learned binary codes.
5.2.2 Effect of the End-to-end Learning
Our framework is an end-to-end framework. To show the advantages of the end-to-end framework, we compare to the following baseline, which adopts a two-stage strategy. In the first stage, DeepHash is learned and the images are encoded into binary codes. In the second stage, we rebalance the binary codes by the data-driven multi-index hashing .
Table 4 shows the comparison results. We can observe that our method performs better than DeepHash and two-stage method. It is desirable to learn the hash function and balanced procedure in the end-to-end framework.
In this paper, we proposed a deep-network-based multi-index hashing method for fast searching and good performance. In the proposed deep architecture, an image goes through the deep network with stacked convolutional layers and is encoded into high level image representation with several substrings. Then, we proposed to learn more balanced binary codes by adding two constraints. One is the feature-level constraint, which is used to make the binary codes distributed as balance as possible in each hash table. Another is the instance-level constraint, which is used to let the buckets in each substrings hash table contain balanced items. Finally, the deep hash model for both the powerful image representation and fast searing is learned simultaneously. Empirical evaluations on two datasets show that the proposed method runs faster than the baseline and achieve comparable performance.
In future work, we plan to apply DMIH in different networks and methods to exploit the effect of the proposed balanced constraints. We also plan to accelerate the running times of extracting the features from the deep network.
Proof of Proposition 1
Suppose that there exists one binary code , and . According to Proposition 1 in paper , we have for a substring. We discuss in two situations: 1) if , then according to the definition of , which contradicts the premise. 2) if and all the first substrings is strictly greater than , then the total number of bits that differ in the last is at most . Using Proposition 1 in paper  again, we have , thus , which contradicts the premise.
Proof of Proposition 2
Suppose that , we have at least one binary code satisfies and . Since is not the -neighbor of query , then and at least differ by bits. Since , we have , thus . According to the assumption and , we have , thus by the definition of , which contradicts the premise.
-  T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng. Nus-wide: A real-world web image database from national university of singapore. In CIVR, Santorini, Greece., July 8-10, 2009.
-  A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518–529, 1999.
-  Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, pages 817–824, 2011.
-  D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval. In FSKD.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe. berkeleyvision. org, 2013.
-  Q.-Y. Jiang and W.-J. Li. Deep cross-modal hashing. CVPR, 2017.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
-  B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, pages 1042–1050, 2009.
-  B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130–2137, 2009.
-  H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
-  W. Li. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 3485–3492, 2016.
-  K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learning of binary hash codes for fast image retrieval. In CVPR, pages 27–35, 2015.
-  H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In CVPR, pages 2064–2072, 2016.
-  L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast free-hand sketch-based image retrieval. CVPR, 2017.
-  Q. Liu, H. Xie, Y. Liu, C. Zhang, and L. Guo. Data-oriented multi-index hashing. In ICME, pages 1–6. IEEE, 2015.
-  W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
-  W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, pages 1–8, 2011.
-  D. Mandal, K. Chaudhury, and S. Biswas. Generalized semantic preserving hashing for n-label cross-modal retrieval. CVPR, 2017.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS, volume 2011, page 5, 2011.
-  M. Norouzi and D. M. Blei. Minimal loss hashing for compact binary codes. In ICML, pages 353–360, 2011.
-  M. Norouzi, A. Punjani, and D. J. Fleet. Fast exact search in hamming space with multi-index hashing. TPAMI, 36(6):1107–1119, 2014.
-  E.-J. Ong and M. Bober. Improved hamming distance search using variable length substrings. In CVPR, pages 2000–2008, 2016.
-  R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, pages 412–419, 2007.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  J. Song, H. Shen, J. Wang, Z. Huang, N. Sebe, and J. Wang. A distance-computation-free search scheme for binary code databases. IEEE Transactions on Multimedia, 18(3):484–495, 2016.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
-  J. Wan, S. Tang, Y. Zhang, L. Huang, and J. Li. Data driven multi-index hashing. In ICIP, pages 2670–2673. IEEE, 2013.
-  J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In CVPR, pages 3424–3431, 2010.
-  J. Wang, T. Zhang, N. Sebe, and et al. A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  M. Wang, X. Feng, and J. Cui. Multi-index hashing with repeat-bits in hamming space. In FSKD, pages 1307–1313. IEEE, 2015.
-  Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2008.
-  R. Xia, Y. Pan, H. Lai, C. Liu, and S. Yan. Supervised hashing for image retrieval via image representation learning. In AAAI, pages 2156–2162, 2014.
-  H.-F. Yang, K. Lin, and C.-S. Chen. Supervised learning of semantics-preserving hashing via deep neural networks for large-scale image search. arXiv preprint arXiv:1507.00101, 2015.
-  R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. TIP, 24(12):4766–4779, 2015.
-  Z. Zhang, Y. Chen, and V. Saligrama. Efficient training of very deep neural networks for supervised hashing. In CVPR, pages 1487–1495, 2016.
-  F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image retrieval. arXiv preprint arXiv:1501.06272, 2015.
-  B. Zhuang, G. Lin, C. Shen, and I. Reid. Fast training of triplet-based deep binary embedding networks. In CVPR, pages 5955–5964, 2016.