Due to the ever-growing large-scale image data on the web, image retrieval has attracted increasing interest. Hashing is a popular nearest neighbor search method that learns similarity-preserving hash functions to encode the images into binary codes.
Many algorithms have been proposed for learning the similarity-preserving hash functions [3, 32, 18]. One of the leading approaches is the deep-networks-based hashing [17, 37, 13], which learn the powerful image representations as well as the binary hash codes. Lin et al.  proposed to learn the hash codes and image representations in a point-wised manner. Li et al.  and Liu et al.  presented deep pairwise-supervised hashing. Further, Zhao et al.  presented a deep ranking based method for multi-label images, and Zhuang et al.  proposed a triplet-based deep hash network.
However, as the image data continue to grow, many new semantic concepts emerge rapidly. And those brand novel classes may emerge with zero or litter labeled images. Thus, it is desirable to learn similarity-preserving hash functions for those novel/target classes.
Zero-shot learning (ZSL) is to build recognition models that capable of recognising novel classes without labeled training samples. The main idea behind ZSL is to exploit knowledge transfer via embedding both classes into a common semantic representation, e.g., the word-vectors of the class names. Inspired by that, Yang et al.  firstly proposed zero-shot hashing (ZSH), which sets up a tunnel to transfer supervised knowledge between the source and target classes via an intermediate-level semantic representation. However, due to source classes and target classes have different data distributions, using the hash functions learned from the source dataset without any adaptation to the target dataset may causes an unknown bias, which is called projection domain shift problem [1, 24]. Pachori et al.  introduced an unsupervised domain-adaptation model for ZSH, which updated the learned hash model by the query data from the target classes.
In this paper, we focus on ZSH with transductive setting [1, 5], i.e., the unlabelled images from the target classes are available. In the real world, the images of the seen classes are usually more common than the novel ones. It is unrealistic to assume that the seen classes are never existing in the unlabelled dataset. Thus, we do not require the unlabelled data are all from the target classes. Specially, two datasets with disjoint classes are considered in TZSH: a labelled source dataset where all samples are from seen classes and an unlabelled dataset that includes unlabelled images from both of the seen and novel classes.
We propose deep architecture for transductive zero-shot hashing via coarse-to-fine similarities mining. As shown in Fig. 1
, our architecture is a two-streams neural network. The first stream is for labeled images of the seen classes and the second stream for unlabeled images, which are designed to relate the target classes to the source ones. Then, we propose coarse-to-fine similarity mining module to transfer the similarities of the labeled source data to the target data. We begin with the coarse stage which is to find the informative images of the novel classes, i.e., removing the hard samples and noisy images. We formulate it as a binary classification problem, and a new layer called cross-images selection layer is proposed to greedily select the most informative target data. With the images found in the coarse stage, we devise a simple and effective strategy for transferring the similarities from the seen classes to the target classes by utilizing the word representations. Finally, a loss function is proposed to capture the similarities among the images, and a hash model is learned to encode all the images into binary codes.
The main contributions of this work are three-folds.
We propose a deep transductive zero-shot hashing framework to solve the projection domain shift problem, which learns the hash functions for both the seen and novel classes.
We propose a simple yet efficient coarse-to-fine similarity mining method for transferring the knowledge from the seen classes to the novel classes.
We conduct extensive evaluations on several benchmark datasets. The empirical results demonstrate that the proposed method achieves superior performance to the baselines.
2 Related Work
Due to the rapidly increasing data, hashing has become a popular method for nearest neighbor search. Many methods have been proposed for hashing, including data independent hashing  and data dependent hashing [3, 19, 11, 32, 25, 31, 18, 22, 10].
Among the supervised methods, learning the hash codes with deep frameworks, e.g., CNN based methods , has been emerged as one of the leading approaches. Lin et al.  proposed to learn the hash codes and image representations in a point-wised manner, which is suitable for large-scale training datasets. Zhang et al.  presented a novel regularization term to learn the deep hash functions. Wang et al.  proposed deep multimodal hashing with orthogonal regularization (DMHOR) method for multi-modal data. Zhao et al.  proposed a deep semantic ranking based method for learning hash functions that preserve multi-level semantic similarity between multi-label images. Zhuang  proposed a fast deep network for triplet supervised hashing. Liu et al.  proposed deep supervised hashing (DSH) to learn compact similarity-preserving binary code for a huge body of image data. Zhang et al.  proposed an efficient training algorithm for very deep neural network by alternating direction method of multipliers. Liu et al.  proposed deep sketch hashing (DSH) for free-hand sketch-based image retrieval. Mandal et al.  and Jiang et al.  present deep hashing framework for cross-modal retrieval.
Although the success of the supervised deep hashing methods, they need a lot of label information. Recently, Lin et al. 
proposed an unsupervised deep learning approach to learn compact binary codes. Three criterions on binary codes are learned in their network: 1) minimal loss quantization, 2) evenly distributed codes, and 3) uncorrelated bits. Wu et al.
present an end-to-end unsupervised deep video hashing (UDVH) which integrates the feature clustering and feature binarization to preserve the neighborhood structure of the binary space, and a smart rotation to balance the binary codes. Xia et al.
proposed a novel unsupervised heterogeneous deep hashing framework, in which the auto-encoder and Restricted Boltzmann Machine (RBM) are utilized to learn the binary codes. Venkateswara et al. introduced a new dataset called office-home to evaluate domain adaptation algorithms.
While, existing unsupervised hashing methods do not consider to leverage useful knowledge from the related datasets. Yang et al.  proposed zero-shot hashing for encoding the images of the unseen classes to the binary codes. In their work, they firstly use NLP model to transform data labels into semantic embedding space, in which the relationships among the seen and unseen classes can be well characterized. Then, the embedding space is rotated for better aligning the visual feature space. Finally, hash functions are learned to transform visual feature space into the embedding space.
Since the underlying data distributions of the seen classes and the novel classes are different, the projection functions learned by the seen classes without any adaptation to the novel classes may cause data bias problem. Fu et al.  proposed transductive multi-view embedding to solve the projection domain shift problem. Guo et al.  proposed transductive zero-shot recognition via jointly learning the shared model space for transferring the knowledge between the classes. The most similar work is , which formulated it as the domain adaptation problem for zero-shot hashing. Given the features of a mini-batch of images belong to the unseen classes, it updates the transformation matrix learned from the seen classes in each iteration.
Although success, most existing zero-shot hashing methods firstly represent each input image by a fixed visual descriptors (e.g., extracted from the pre-train deep models ), then followed by separate projection and quantization steps to encode the representation into a binary code. Such fixed visual features may not be optimal and it is expected to learn the visual features that can sufficiently preserve the images similarities. In this paper, we propose a deep framework for learning better common representation.
3 Transductive Zero-Shot Hashing
In this section, we describe an architecture of deep convolution network designed for transductive zero-shot hashing (TZSH).
We firstly introduce notations to formalise the TZSH setting. There is a labeled source dataset , where is an image and is the class name/label of the -th image. We also have a large set of unlabeled data , which includes the images from both the seen classes and the novel classes but no annotations, where the novel classes are disjoint from the seen ones, that is . The goal of TZSH is to learn a deep model from the labeled images from the source dataset and the unlabeled dataset , while the similarities among the novel images and seen images should also be preserved.
As shown in Fig. 1, the proposed architecture mainly contains two parts: 1) a shared deep network for learning the common semantic image representations for both the source and target datasets, and 2) a coarse-to-fine similarity mining module for finding the similarities of the target images with the help of the semantic word spaces. By the word representations of class names, we transfer the similarities from the seen classes to the novel ones by the proposed cross-images selection layer in a greedy manner. Finally, we can easily construct the loss function by using the above found similarities.
3.1 Common Semantic Representation via Shared Two-streams Network
To encode both the source and target images into the same semantic space, we introduce a two-stream architecture, which includes two inputs. The first stream operates on the labelled source data , and the second stream operates on the unlabelled data , where and are the sizes of mini-batch, respectively. These images go through the deep network with the stacked convolutional and full-connected layers, and are encoded to a common representation space. Note that the weights of the corresponding layers are shared between the two streams, and are trained jointly. The output common semantic representations are denoted as and , where is the proposed deep network.
3.2 Coarse-to-Fine Similarity Mining
There are no annotations for the unlabelled images, making it a challenging problem for constructing the hash functions. With the learned common semantic spaces, we aim to detect the similarities among the unlabelled images by the proposed coarse-to-fine approach in this subsection.
We use the following two stages to find the similarities among the target data: 1) we first greedily select the most informative images in the coarse stage, and then 2) we further greedily select an image from the coarse set for each of the target class, respectively.
3.2.1 Coarse Similarity Mining
The main idea behind our method is that there is commonly a difference between the distributions of the images from the novel classes and the images from seen classes. Thus, we can formulate it as a binary classification problem: the images of the seen classes are the negative samples and that of the novel classes are the positive samples. To train the model, we can use the labeled images (all from the seen classes) as the negative samples. Due to no annotations, the greedy method is applied and the cross-image selection layer is proposed for finding the positive samples.
More specifically, we add a fully-connected layer with -dimensional after the semantic representations and
, respectively. We have the deep featuresand , in which the -th row of represents the probability of the image coming from the seen classes is or from the novel classes is . We propose a cross-images selection layer in our network for solving the target data without annotations. It greedily chooses the most representative images, e.g., images, which have the largest probability scores come from novel classes. Thus, we divide the into groups with equal length , and each slice select one image which has the highest score. Fig. 2 illustrates the proposed greedily selection procedure. Formally, we greedily select images that has the highest scores coming from the novel classes, which can be formulated as
If the -th image has a relatively higher/lower score, the -th image likely/unlikely belongs to the novel classes. The highest score means most likely belong to the novel classes. Thus, these selected images are most informative images from the mini-batch images, we can use them as positive samples.
To learn the parameters, the problem is a traditional binary classification problem. We can use the softmax loss  to optimize this problem, which is widely used loss function for classification problem in neural networks. The loss objective in coarse stage can be formulated as
Discussions. First, we show that our proposed deep architecture is able to detect the novel images. Suppose that -th image is not from the novel classes, it is of a seen class. In this case, the first stream will assign the low probability score to this image due to the images of the seen classes are regarded as negative samples and the network’s parameters are shared. Thus, the probability score can be suppressed in the next iteration and make this image not to be chosen again.
Second, the cross-images selection layer is attractive because it also allows end-to-end training. Specially, the gradient of the objective (2) is . Then, we can just put the gradient to the place where the image came from and let the gradients of other images to be zero, that is the gradients of informative images are and other images are in the cross-images selection layer when back propagation.
3.2.2 Fine Similarity Mining
With the found images from the unlabelled data, we need a fine stage to find the similarities among these images. Although without the annotations, fortunately, the similarities among the class names are available, e.g., we can obtain them via the existing word2vec  model. Specially, suppose the -th class name is , we denote as the word-vector for the -th class name. The similarity between the -th class and the -th class is defined as
is the cosine similarity between the two vectors.
Inspired by that, we also use a greedy approach to detect the similarities among the target images. Fig. 3 shows the framework of our method. We project the source classes into the target classes. More specially, we add another fully-connected layer with dimensional features for and , where is denoted as the number of target classes. These images go through the deep neural network and output the probability scores: and . We also use the cross-images selection layer to greedy select the images which have the largest probability come from the novel classes:
For training the parameters, we also use the softmax loss to optimize the problem. For the source data, we firstly calculate the similarities between the labels and all target classes according to (3):
Then, a normalization process is calculated as to make sure that . is the probability scores that the image looks like the novel classes, which is used as the label for the -th image. For the target data, the selected images are assigned to the corresponding classes, respectively. The softmax loss can be defined as
In the proposed fine similarity mining module, the performance depends on the similarities between the seen and novel classes. If the novel class has high similarity with one seen class, our method can work well. If the novel class is not related to any seen classes, our method may fail.
3.3 Similarity-Preserving Loss
As we aim to learn the -dimensional binary codes, we use a fully-connected layer to compact each into a -dimensional vector . Through the coarse-to-fine module, we have the deep binary features and for the source data and the most representative target data.
The hamming distance between the similar images should be small and the hamming distance between the dissimilar images should be large. For the source data, the similarities can be obtained via the classes , i.e, if , otherwise . For the -th image comes from the source classes and the -th image from the target classes, the similarity is . For the similarities among the target images, we use both the hamming distance and detected similarities as the supervised information to make sure not introduce too much wrong annotations. Specially, if the indices of the largest value in and are the same and the hamming distance between and is small, the two images are similar, i.e., . When the indices are not the same and the hamming distance is large, they are dissimilar, i.e., . The overall loss function can be defined as
In this section, we evaluate and compare the performance of the proposed method with several state-of-the-art algorithms.
4.1 Datasets and Experimental Settings
We conduct extensive evaluations of the proposed Transductive Zero-Shot Hashing (TZSH) on three benchmark datasets.
Animals with Attributes (AwA) 222http://attributes.kyb.tuebingen.mpg.de/ consists of 30,475 images of 50 animals classes, which 85 numeric attribute values are provided for each class.
CIFAR-10 333http://www.cs.toronto.edu/ kriz/cifar.html consists of 60,000 images in 10 classes, which each class has 6,000 images. It is labeled subset of the 80 million tiny images dataset.
In each dataset, we resize images of all these databases into . We randomly select 2,500 images from the target classes as the test set, and the rest images are used as the retrieval database. In the retrieval database, 10,000 images from the seen classes are randomly chosen as the labeled set and the other images are used as unlabeled set .
We evaluate the performance by the popular metric: Mean Average Precision (MAP)and the precision scores within Hamming distance 2 as the evaluation metric.
We implement the proposed method using the open-source Caffe  framework. The mini-batch sizes are setted to , , and is setted to 32 for all experiments. In this paper, we use AlexNet  as our basic network. For all deep learning-based methods, the weights of layers are initialized by the pre-trained AlexNet model 444http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel unless noted otherwise. IMH , COSDISH , SDH , ZSH , One-Stage Hashing  and Domain Adaptation Zero-shot Hashing (DA-ZSH)  are selected as the baselines.
4.2 Results on ImageNet
ImageNet is a large dataset and consists of 1.2 million color images from 1,000 categories. For fair comparison, we following the setting of . Precisely, 100 classes that have corresponding semantic word vectors learned from Wikipedia text corpus are selected. We randomly select 10 categories as the target classes and the rest 90 categories as the seen classes.
The pre-trained AlexNet model uses all 1,000 classes in training. It is not suitable for the setting of zero-shot hashing. Thus, we train a new AlexNet model only using the rest 900 classes. The weights of our network are initialized by the new model. And the new model is also used to extract features for the images.
We compare the proposed method with several state-of-the art methods as shown in Fig 4. Note that the results of other methods are directly cited from the work . As can be seen, the proposed TZSH performs better than the baselines in ImageNet dataset. Note that ZSH is the recent proposed method, which achieves an excellent performance on this dataset. Even so, our method also performances better than ZSH.
4.3 Results on AwA Dataset
AwA dataset contains 50 animals species. For fair comparison, we following the setting of the most similar work domain adaptation zero-shot hashing (DA-ZSH) , which also uses the unlabeled data of the novel classes. Specially, 10 classes are selected as the target classes and the rest 40 classes as the seen classes. The 85 numeric attribute values for each class are used as semantic vectors.
|16 bits||32 bits||64 bits||96 bits||128 bits|
, which also uses the features extracted from the deep network. Again, our method yields the highest accuracy and beats all the baselines. Note that we only train our model by the labelled images of the seen classes and the unlabelled data, which indicate that our model can performance well on the novel classes even without the annotations.
4.4 Results on CIFAR-10
This dataset consists of color images from 10 classes, i.e., airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. For fair comparison, we follow the setting of the most similar work domain adaptation zero-shot hashing (DA-ZSH) . We enrich each class with 300-dimensional semantic word vectors with real number by the pre-trained word2vec model . Two classes are randomly selected as target classes and the rest eight classes as seen classes.
Table 2 shows the performances of all comparing approaches. We take ship-truck, automobile-deer, dog-truck and cat-dog as target classes for zero-shot hashing, respectively. We can see that our method can achieve very high accuracy compared to the baselines. For instant, DA-ZSH achieves an excellent performance on this dataset. Even so, the MAP of TZSH is 0.5502 on average when the number of bits is 16, while the second best algorithm DA-ZSH is 0.4126.
The main reason for the good performance of our method is that TZSH can find the similarities of the target images via the proposed similarity mining module. Thus, similarities of target classes can be incorporated and help to obtain a good performance. However, for cat and dog, these two classes are dissimilar to the other eight classes, leading to hard to transfer the similarities of the seen classes to the novel ones. Thus it is not supervising that the results are poor.
|16 bits||32 bits||64 bits||96 bits||128 bits|
To better understand our proposed method, we do further analysis in the following subsections.
4.4.1 Effect of Seen Category Ratio
In this set of experiments, we compare the performance of our method w.r.t different number of seen classes, in which the number of seen classes vary from 2 to 8 555Note that we do not give the results when the number of seen classes is one and nine. The reason is that we cannot construct hashing functions with only one seen class and we know all the annotations in transductive setting when the number of seen class is nine..
Since the DA-ZSH  is the most similar work and it performs very well, we choose it as the baseline. Fig 5 shows the comparison results, from which it can be seen that: our method performs best in all different numbers of the seen classes. For example, the MAP of our method is 0.6908 compared to 0.5325 of the DA-ZSH when the number of the novel classes is two. We can also observe that our method achieves significant improvement over the baseline when the number of the seen classes is larger than the number of the novel ones. This is because it is hard to transfer the similarities of the seen classes to the novel ones when the number of the seen classes is small.
4.4.2 Effect of the Unlabeled Size
In training, we have an unlabeled set . In this set of experiments, we explore the effects on MAP with respect to different numbers of the unlabeled size. We take ship-truck as target classes and the rest 8 classes as seen classes.
For exploring the effect of unlabeled size, the numbers of the unlabeled set are varying from 10,000 to 40,000. Table 3 shows the results. We can observe that our method can perform well on different numbers of the unlabeled set. This is because our method can detect the similarities among the ship-truck images.
In this paper, we proposed a deep-network-based transductive zero-shot hashing method for image retrieval. In the proposed deep architecture, we designed coarse-to-fine similarity mining for finding the similarities of the novel classes. A deep binary classification network was proposed to find the most informative images from the target classes in the unlabelled data. Then, we transferred the similarities of the seen classes to the found images by the help of word representations. Based on the found similarities, we finally proposed a ranking loss function for preserving the similarities. Empirical evaluations on three datasets showed that the proposed method significantly outperforms the state-of-the-art methods.
In future work,we plan to study zero-shot hashing for multi-label image retrieval, in which the image may contain objects of multiple categories.
-  Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. TPAMI, 37(11):2332–2345, 2015.
-  A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518–529, 1999.
-  Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, pages 817–824, 2011.
-  J. Guo and J. Li. Cnn based hashing for image retrieval. arXiv preprint arXiv:1509.01354, 2015.
-  Y. Guo, G. Ding, X. Jin, and J. Wang. Transductive zero-shot recognition via shared model space learning. In AAAI, volume 3, page 8, 2016.
-  Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe. berkeleyvision. org, 2013.
-  Q.-Y. Jiang and W.-J. Li. Deep cross-modal hashing. CVPR, 2017.
-  W.-C. Kang, W.-J. Li, and Z.-H. Zhou. Column sampling based discrete supervised hashing. In AAAI, pages 1230–1236, 2016.
A. Krizhevsky, I. Sutskever, and G. Hinton.
Imagenet classification with deep convolutional neural networks.In NIPS, pages 1106–1114, 2012.
-  B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In NIPS, pages 1042–1050, 2009.
-  B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In ICCV, pages 2130–2137, 2009.
-  H. Lai, Y. Pan, Y. Liu, and S. Yan. Simultaneous feature learning and hash coding with deep neural networks. In CVPR, pages 3270–3278, 2015.
-  W. Li. Feature learning based deep supervised hashing with pairwise labels. In IJCAI, pages 3485–3492, 2016.
-  K. Lin, J. Lu, C.-S. Chen, and J. Zhou. Learning compact binary descriptors with unsupervised deep neural networks. In CVPR, 2016.
-  K. Lin, H.-F. Yang, J.-H. Hsiao, and C.-S. Chen. Deep learning of binary hash codes for fast image retrieval. In CVPR, pages 27–35, 2015.
-  H. Liu, R. Wang, S. Shan, and X. Chen. Deep supervised hashing for fast image retrieval. In CVPR, pages 2064–2072, 2016.
-  L. Liu, F. Shen, Y. Shen, X. Liu, and L. Shao. Deep sketch hashing: Fast free-hand sketch-based image retrieval. CVPR, 2017.
-  W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In CVPR, pages 2074–2081, 2012.
-  W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, pages 1–8, 2011.
-  D. Mandal, K. Chaudhury, and S. Biswas. Generalized semantic preserving hashing for n-label cross-modal retrieval. CVPR, 2017.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
-  M. Norouzi and D. M. Blei. Minimal loss hashing for compact binary codes. In ICML, pages 353–360, 2011.
-  S. Pachori, A. Deshpande, and S. Raman. Hashing in the zero shot framework via domain adaptation. arXiv preprint arXiv:1702.01933, 2017.
-  S. J. Pan and Q. Yang. A survey on transfer learning. TKDE, 22(10):1345–1359, 2010.
-  R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, pages 412–419, 2007.
-  F. Shen, C. Shen, W. Liu, and H. Tao Shen. Supervised discrete hashing. In CVPR, pages 37–45, 2015.
-  F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, and Z. Tang. Inductive hashing on manifolds. In CVPR, pages 1562–1569, 2013.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In NIPS, pages 935–943, 2013.
-  H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised domain adaptation. CVPR, 2017.
-  D. Wang, P. Cui, M. Ou, and W. Zhu. Deep multimodal hashing with orthogonal regularization. In IJCAI, pages 2291–2297, 2015.
-  J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In CVPR, pages 3424–3431, 2010.
-  Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, pages 1753–1760, 2008.
-  G. Wu, L. Liu, Y. Guo, G. D-ing, J. Han, J. Shen, and L. Shao. Unsupervised deep video hashing with balanced rotation. IJCAI, 2017.
-  Z. Xia, X. Feng, J. Peng, and A. Hadid. Unsupervised deep hashing for large-scale visual search. In IPTA, pages 1–5. IEEE, 2016.
-  Y. Yang, Y. Luo, W. Chen, F. Shen, J. Shao, and H. T. Shen. Zero-shot hashing via transferring supervised knowledge. In MM, pages 1286–1295, 2016.
-  R. Zhang, L. Lin, R. Zhang, W. Zuo, and L. Zhang. Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. TIP, 24(12):4766–4779, 2015.
-  Z. Zhang, Y. Chen, and V. Saligrama. Efficient training of very deep neural networks for supervised hashing. In CVPR, pages 1487–1495, 2016.
-  F. Zhao, Y. Huang, L. Wang, and T. Tan. Deep semantic ranking based hashing for multi-label image retrieval. arXiv preprint arXiv:1501.06272, 2015.
-  B. Zhuang, G. Lin, C. Shen, and I. Reid. Fast training of triplet-based deep binary embedding networks. In CVPR, pages 5955–5964, 2016.