As image databases grow ever larger, the importance of performing efficient nearest neighbor (NN) based similarity search has increased. One effective paradigm for this is learning to hash, whose objective is to learn a mapping from images to binary hash codes, such that a query image’s code is close to semantically similar images in a pre-computed database (when measured e.g. in terms of Hamming distance). In addition to offering fast exact NN search by using e.g. Multi-Index Hashing , the binary hash codes also result in considerably lower storage requirements. A good review on learning to hash is given by Wang et al. .
Two classes of hashing for retrieval are Data-independent methods such as the classic (LSH) , and Data-dependent (either supervised or unsupervised) methods, which learn hash codes tailored to the datasets in question and have shown considerably better performance. Deep3], , , ,  and unsupervised  settings.
Most supervised hashing techniques use only pairwise binary similarity, i.e. only images of the same class are considered in any way similar. However, such a measure is too crude to properly learn deeper semantic relationships in data. As excellently illustrated in , models trained with pairwise similarities need not learn anything of inter-class relations: a cat-dog pair has the same distances as cat-airplane. We demonstrate and measure this effect in our results.
Another key issue of most deep hashing approaches is the separation into (i) learning a continuous embedding via backpropagation (ii) subsequent thresholding of these into hash codes. Hence the differentiable optimization and hashing objectives are not aligned; the model is not forced to learn exactly binary outputs, and information loss occurs when performing the thresholding operation.
Our main contributions are as follows:
We address the limits of pairwise binary similarity by using more nuanced similarity metrics: using semantic distances between labels or other supervisory signals, such as captions. Specifically, rather than directly using “image x has label y” we consider relations such as “image is as similar to as label is similar to ”. For a batch of B images, the target is then a BxB size matrix with elements corresponding to a semantic similarity distance between the corresponding labels. Hence we are using information between all examples in the batch, instead of e.g. a triplet based approach, which may require expensive mining schemes. This results in superior performance compared to binary similarity, with an insignificant computational overhead.
We resolve the hash code binarization problem by introducing a novel loss function based on the differentiable Kozachenko-Leonenko estimator of the KL divergence . We minimize the KL divergence between a balanced binary (Beta) target distribution and the continuous valued network outputs. Note that this is an estimate for the true KL divergence between a given target distribution and distribution of , for a network
. This regularizes network outputs towards binary values, with in minimal information loss after thresholding. Additionally, these hash codes will maximally utilize the hash code space since they are uniformly distributed, which is crucial in avoiding the trivial solution problem, as well as in guaranteeing efficient retrieval when using e.g. the Multi-Index Hashing approach.
We measure the performance of our methods by focusing on the mean average hierarchical precision (mAHP)  in addition to mAP and accuracy metrics, that have been shown to be highly unreliable for supervised retrieval. The mAHP measure will take into account the similarity between the query and retrieval results, and will therefore be a more reliable retrieval metric. Specifically, our experiments demonstrate that models trained only with a classification loss may indeed suffer from low mAHP while seemingly performing very well in terms of mAP. While mAHP is undoubtedly a better retrieval metric than mAP, the ultimate test of any retrieval model is how it performs with completely unseen data. We therefore also measure our method’s performance in the Zero Shot Hashing (ZSH) setting , where a model is trained with the 1000 label ILSVRC2012 dataset, and tested with the full ImageNet dataset consisting solely of classes unseen during training.
We test our method and obtain state-of-the-art retrieval results in terms of mAHP on the CIFAR-100 and ImageNet (ILSVCR2012) datasets. We experiment with (i) using the WordNet hierarchy to define a non-binary semantic similarity metric for the labels; and (ii) sentence embeddings computed on the class labels’ WordNet descriptions by using a version of BERT fine-tuned on the MRPC corpus.
We extend this method to work with weakly-labeled captions in the form of Google’s Conceptual Captions (CC) dataset, where we generate sentence embeddings with BERT, and use these embeddings to define a distance matrix, enabling retrieval of images with nuanced semantic content beyond simple class labels. In addition, we test our methods in the ZSH setting with the full 21k-class ImageNet unseen label dataset, and demonstrate that our method is effective in a real world content retrieval setting.
2 Related work
In Unsupervised Semantic Deep Hashing (USDH, ), semantic relations are obtained by looking at embeddings on a pre-trained VGG model on ImageNet. The goal of the semantic loss here is simply to minimize the distance between binarized hash codes and their pre-trained embeddings, i.e. neighbors in hashing space are neighbors in pre-trained feature space. This is somewhat similar to our notion of semantic similarity except that they use a pre-trained embedding instead of a labeled semantic hierarchy of relations. Some works [41, 40] consider class-wise Deep hashing, in which a clustering-like operation is used to form a loss between samples from the same class and, in , across levels from the hierarchy. In our method, we do not require explicit targets to be learned by the network, only relative semantic distances to be supplied as targets.
explored image retrieval (without hashing) using semantic hierarchies to design an embedding space, in a two-step process. Firstly, they directly find embedding vectors of the class labels on a unit hypersphere, via a linear algebra-based approach, such that distances of these embeddings are similar to the supplied hierarchical similarity. In Stage 2, they train a standard CNN encoder model to regress images towards these embedding vectors. We make use of hierarchical relational distances in a similar way to constrain our embeddings. We will however not regress towards any pre-learned fixed target embeddings, but instead require that the distances between the neural network outputs will match the distances inferred from the target similarities. Also, unlike our work, use only continuous representations and require an embedding dimension equal to the number of classes, whereas we learn compact binary hash codes of any dimension, enabling fast queries from huge databases.
, embedding vectors on a unit hypersphere are also used. They use the Kozachenko-Leononenko nearest neighbor based estimator to maximize entropy on this hypersphere to encourage uniformity of the network outputs. We also use this estimation method, but to minimize the Kullback-Leibler divergence between the (empirical) distribution of network outputs and a target distribution, which in our case is a near-binary one. Also, they consider only a binary similarity matrix based on nearest neighbors in the input space, using a triplet loss. They also do not consider hashing.
We also consider how semantic relations in language models apply to guiding similarity in the image domain. Pre-trained word embeddings have found wide adoption in many NLP tasks 
; however, their suitability for various transfer learning tasks depends heavily on the particular domain and training dataset. Therefore, considering combinations of different word embeddings (“meta-embeddings”),  or sentence embeddings is a promising direction. In DeViSE , a language model and visual model are combined, in order to transfer knowledge from the textual domain to the visual domain, starting from pre-trained models in each domain, and attempting to minimize distances in embedding space between text and image representations. Rather than trying to directly learn a mapping from images to word embeddings, we can use our relaxed similarity distance matching objective to align these two domains.
3 Learning binary hash codes while preserving semantic similarity
Suppose we have a dataset of images and corresponding targets , which could be labels, attributes, captions etc. We wish to learn useful representations by backpropagation by using a deep neural network with weights , but so that the network outputs are binary valued, for use as hash codes in fast hash-based queries from massive databases. Clearly just thresholding the network output to generate binary vectors would not work, because the sign function is not differentiable. One proposed solution in  is to use a “soft sign” function with adjustable scaling. This would however not guarantee uniformity, nor that the actual values would be near binary.
Usefulness of the representation is in our case defined by semantic similarity: network outputs of semantically similar images should be close to each other when measured in Hamming distance. In the optimal case, there would be a continuous measure of semantic similarity based on the targets , which the network would learn to reproduce as accurately as possible in the Hamming distances between the network outputs. These are the two old problems in learning to hash, for which we propose two novel solutions, in sections 3.2 and 3.3.
3.1 Overall Methodology
We train a CNN with a bottleneck output layer that we can binarize to produce both database and query codes for image retrieval; see Figure 1 (a). A target distribution is used to constrain the continuous embedding to be approximately balanced and near-binary. Our total loss function is defined as
where is a standard classification cross-entropy loss in most experiments (replaced by a regression loss for the Conceptual Captions dataset). and refer to the similarity and KL divergence loss terms defined in the next sections, and and
are hyperparameters, set to scale the various loss contributions approximately equally, which occurs at.
3.2 Learning semantic similarity from data
We assume that it is possible to infer the semantic similarity between two images , from the corresponding targets , by a distance measure . The most common approach is to define the measure by using labels and defining when the labels are the same, and when they are different. In a worst case scenario, using such a binary optimization criterion could lead the model to learn a unique hash code corresponding to each label, which would make the model effectively a classifier. Even worse, the model would still seem very strong when measured in terms of the mAP (mean-average precision) score, as was excellently explained in e.g. . We will therefore choose to work with datasets from which we can infer multiple or continuous levels of similarity.
While it would be possible to first learn embeddings for the targets as was done in e.g.  and then train the network to regress to these embeddings, we will instead only use the inferred distances and let the network adapt to whichever outputs are most natural, while respecting the inferred distances.
Suppose that the image and target pairs are denoted as for minibatch index and denote the elements of the inferred similarity matrix between targets and as . We seek to match this distance with the Manhattan distance between the corresponding neural network outputs . We define a loss term
where the term is an additional weight, which is used to emphasize similar example pairs (e.g. cat-dog) more than distant ones (e.g. cat-airplane). We use such a weighting because we are mostly interested in retrieving images that are very similar to query images. Furthermore, we are not interested as much in learning the absolute distances between examples, but instead the relative ones. Hence we have added normalizing terms and , which render the loss term scale invariant both in terms of the network outputs and target distances. This will enable the network to learn outputs as flexibly as possibly, while still respecting the relative distances learned from the targets. We have observed both these schemes to be helpful during learning. We use a slowly decaying form for the weight, , with parameter values and .
3.3 Minimizing empirical KL divergence
We are interested in extracting binary hash codes out of the network outputs to facilitate fast queries from massive databases. A simple approach to doing this after training has been completed would be to simply round the to a binary value (or equivalently to take the sign). This would however lead to information loss because the semantic similarity will not be preserved, and because values near zero would be assigned essentially randomly to binary values. Additionally, there are no guarantees that the hash codes would be utilized efficiently. Therefore, we want to impose that the distribution of outputs will be (nearly) binary and maximally uniform, while retaining differentiability of the loss function to allow learning by backpropagation. This way the information loss due to rounding will be minimal, since the values will already be close to binary. This would further ensure that images within a given class don’t share a unique hash code but will also be spread out. We could achieve this in principle by minimizing the Kullback-Leibler divergence between the distribution of the network outputs and the target distribution (with standard abuse of notation), defined as:
where on the last line we have split the KL divergence into cross-entropy and entropy terms. The KL divergence attains the minimum value of zero if (and only if) . Analytical solution of such an objective is of course intractable in general, so we resort to instead minimizing a k-nearest neighbor Kozachenko-Leonenko empirical estimate of the KL divergence (see ). Our empirical loss term for achieving this is defined as
where denotes the distance of to a nearest vector , and is a sample (of e.g. size ) of vectors from a target distribution. We have again performed a split into empirical cross-entropy and entropy terms on the second line, corresponding to the analytical expressions in Eq. (3). It is now easy to see intuitively how such a loss function will be minimized: the samples drawn from the “model distribution” will need to be as close as possible to the samples drawn from the target distribution, while at the same time making sure that all the won’t “mode collapse” to a same by maximizing the inter-distance between the , i.e. the empirical entropy of .
3.4 Distribution matching via the Kozachenko-Leonenko KL divergence estimator
Our novel KL-based loss term has been used to help with hash-code binarization, but could impose an arbitrary distribution. We give some more intuitive exposition of its function in Figure 1 (b), where we represent graphically how this empirical Kozachenko-Leonenko estimator forms an estimate of the Kullback-Leibler divergence between an the k-Nearest Neighbors of an observed distribution and a target distribution. Effectively, it consists of ensuring the mean within-distribution (observed–observed) nearest neighbor distance of each sample is similar to that of the closest across-distribution (observed–target) distance.
We implement our method in PyTorch with mixed precision training distributed across up to 64 NVIDIA Tesla V100 GPUs. We study ablations of models trained with the different losses in Eq. 1, denoted as “SIM”, “KL” and “CLASS” and combinations thereof, across 3 datasets.
For CIFAR-100 (Section 4.1) and ImageNet (Section 4.2), to define , we use NLTK’s  implementation of the Wu-Palmer Similarity metric (WUP) , which defines similarity between class labels as the shortest number of edges between them in the WordNet graph, weighted by the distance in the hierarchy. For ImageNet, we also consider using sentence embeddings computed on the class labels’ WordNet descriptions by using a version of BERT finetuned on the MRPC corpus. For all experiments we included comparisons with available prior work that included mAHP in their results.
In Section 4.3, we train a model on the Conceptual Captions dataset, also using the BERT embeddings of the images’ captions as a reference similarity metric. We also report in Section 4.4 some results in a Zero Shot Hashing setting when retrieving from a dataset with totally unseen classes, demonstrating the generalization capabilities of our method.
As discussed in the introduction, using mAP to measure retrieval performance is imprecise: a perfect classifier can yield a perfect mAP score, while having poorly distributed hash codes and no notion of similarity between classes. We therefore focus on the mAHP score, and show how a network trained with a classification loss only has a high mAP, but poor mAHP score.
|Code length||mAHP||mAP||Class acc.|
|Method||mAHP||mAHP bin||mAP||Class acc.|
|Centre Loss, ||0.6815||-||0.4153||75.18%|
|Label Embedding, ||0.7950||-||0.6202||76.96%|
We used the Resnet-110w model architecture as in , where the top fully connected layer is replaced to return embeddings at the size of the desired hash length. We also added a small classification head, which is detached from the rest of the network when not using as an additional task. We used the Adam optimizer with a learning rate
and trained the model for 500 epochs on a single V100 GPU. We usedweight decay and a batch size of 512.
We see in Table 2 how our method improves considerably over previous comparable methods for both mAHP and mAP scores, while using only 64-dimensional binary valued codes. Note especially how binary mAHP does not drop compared to float mAHP for SIM-KL (in fact it increases within error margins), whereas there’s a noticeable drop without the KL loss. There is a drop in SIM-KL-CLASS, which could be due to the additional classification loss hampering the effects of the KL loss. Note that when is absent, the codes/ network outputs are used as fixed representations for classification, i.e. gradients are not propagated through the main hashing network.
We have also provided a study on the effect of the code length (network output dimension) on mAHP and mAP scores, validating that longer codes are in general better for retrieval, but also that the difference from 64 to 128 dimensions is not substantial, at least for CIFAR-100.
|Centre Loss, ||0.4094||-||0.1285||70.05%||-|
|Label Embedding, ||0.4769||-||0.2683||70.94%||-|
| (1000 dim float)||0.7902||-||0.3037||48.97%||-|
| (1000-d float)||0.8242||-||0.4508||69.18%||-|
|Ours: Trivial One-hot solution||0.4389||-||0.7547||75.40%||16.31|
|Ours: WUP, 64-d codes:|
|Ours, BERT, 64 dim codes:|
We also used the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 dataset. We used ResNet50  with slight modifications as in , where output is now equal to the hash code dimension. We again added a small classification head detached from the rest of the network when not learning using the classification loss
. We use a standard ImageNet distributed training scheme, using SGD with Nesterov momentum with a learning ratescaled by the number of GPUs. We conducted all training with 32 GPUs and for a total of 320 epochs, decaying the learning rate by every 100 epochs. We used weight decay and a batch size of with mixed precision training. We also use sentence embeddings computed on the class labels’ WordNet descriptions by using a version of BERT finetuned on the MRPC corpus.
Results together with an ablation study are reported in Table 3. Perhaps the most important entry in the table is the “Trivial solution”, where the output code is just the 1000 dimensional class prediction vector. It was observed already in  that such trivial solutions seem to be excellent retrieval models when measured in terms of mAP. We see however that the mAHP score is very low, as well as obviously the binary entropy, given there are only 1000 unique, equidistant “hash codes”. Therefore the retrieval mAP score is not a good retrieval metric in the supervised learning scheme. The best model is SIM-KL-CLASS with BERT embeddings. Note especially how adding the KL loss to a classification model results in only a small difference between float and binary mAHP. Also, adding a classification loss improves mAP and classification accuracy, but does not have a substantial effect on mAHP. We note especially that the BERT sentence embeddings are qualitatively much better than the WUP similarity in terms of mAP and Accuracy. Note however that the mAHP scores between WUP and BERT are not comparable, since mAHP values depend on the distance values.
We also report the Binary Entropy as a measure of the diversity of the binary hash codes, which is estimated by the Kozachenko-Leonenko estimator of entropy (the entropy term in Eq. 4). The value 16.4 for the classification-only trained networks is especially low, corresponding to only around 1000 unique but uniformly distributed hash codes (although in reality there are probably more than 1000 dense clusters of codes). Completely uniformly distributed sampled codes in 64 dimensions would amount to entropy of 19.93, which is lower than the highest values in the table. This is simply because the the network learns to not cluster the codes close to each other, whereas samples drawn from a truly uniform distribution can be arbitrarily close to each other.
|@250||@250 bin||Tau float||Tau bin||Entropy|
4.3 Google’s Conceptual Captions
We use the same modified ResNet50 architecture as above. We use same hyperparameters as with ImageNet, except we use the Adam optimizer with learning rate 3e-4, and stop training at 120 epochs. We average pool the BERT sentence embeddings and use them to compute the distance matrix for each minibatch. There are unfortunately no standard metrics to measure retrieval performance when no class labels are given. We therefore resort to reporting the mAHP and Kendall Tau distance by using Manhattan distances to compare the ranking of one million results retrieved per query with the codes vs. the sentence embeddings. The Kendall Tau distance has the advantage that it doesn’t depend on the actual distance but only the ranking (although the ranking is still determined by the sentence embeddings), and that it can be compared to the sentence embedding Kendall Tau distance, which is 0.395 for the BERT embeddings. We compare a SIM-KL model with a “REG” model trained by regressing to the embeddings, and again observe the benefits of a KL loss. We also report the Binary Entropy, and again observe significantly lower values for the REG model.
4.4 Zero shot hashing results
We also test our method on Zero Shot Hashing, following the procedure in ; see Table 5. Our models are trained with (i) the ILSVRC2012 1000 class dataset, or (ii) the CC dataset, and tested on completely unseen classes by retrieving from the full 14M ImageNet dataset with the ILSVRC2012 labels removed. We see that our models perform very well in a completely out-of-sample retrieval problem, and specifically perform much better then the classification only baseline.
|Flat hit @K||K=1||K=2||K=5||K=10||K=20|
|EXEM (1NNs) ||1.8||2.9||5.3||8.2||12.2|
|SIM-KL-CLASS (Ours, trained on ILSVRC2012)||9.9||16.0||24.6||31.7||38.9|
|CLASS (Ours, trained on ILSVRC2012)||15.3||20.5||28.8||34.2||41.4|
|SIM-KL-REG (Ours, trained on CC dataset)||11.1||13.2||16.8||20.3||24.5|
4.5 Example retrieval results
We presented a novel method to learn binary codes representing semantic image similarity by defining a distance matrix per minibatch of samples, and training the network to learn to match these distances. We also showed that by using the empirical KL loss, information loss can be minimized when the continuous valued codes are quantized. This leads to virtually no decrease in retrieval performance when using hashing based retrieval methods, suitable for efficient semantic retrieval on massive databases. Without this loss, performance can be degraded significantly. We also showed how modern language models can be used to extract semantically similar caption embeddings, which can be used in the semantic learning to hash scheme. Another interesting result is that such language models also yield higher quality learned embeddings on class based data with ImageNet, than using the native WordNet-based WUP measures. We have presented the Kendall-Tau metric as one suggested metric when no classes are available. We have also demonstrated real world retrieval performance on unseen classes, and learning detailed notions of semantic similarity beyond class labels.
Appendix 0.A Appendix
0.a.1 Selection criteria for caption embeddings
We use the STSbenchmark sentence similarity dataset  to select embeddings best suited for our purposes, i.e. the embeddings will capture semantic similarity of captions as accurately as possible. Each sentence pair in the dataset is scored from 0 to 5 according to semantic similarity. We extract average pooled embeddings for various Transformer based sentence encoders from the huggingface repository  for each sentence pair, and compare the Spearman correlation with the ground truth similarity. We also sort the sentence pairs by embedding based predicted similarity, and compare the ranking by Kendall Tau distance. Both scoring methods imply that the BERT model  finetuned with the MRPC dataset  is the best at capturing sentence semantic similarity, although it is possible that other methods than simple average pooling could yield even better results (some results, such as with the RoBERTa yielded suspiciously poor results). Further details of these scores in shown the supplementary material.
-  (2018-09) Hierarchy-based Image Embeddings for Semantic Image Retrieval. External Links: Cited by: §1, §2, §3.2, §4.1, Table 2, Table 3.
-  (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §4.
-  (2017-02) HashNet: Deep Learning to Hash by Continuation. External Links: Cited by: §1.
-  (2017) Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055. Cited by: §0.A.1, §1, §4.2, §4.
-  (2016) Synthesized classifiers for zero-shot learning. In , pp. 5327–5336. Cited by: Table 5.
-  (2017) Predicting visual exemplars of unseen classes for zero-shot learning. In Proceedings of the IEEE international conference on computer vision, pp. 3476–3485. Cited by: Table 5.
-  (2018-04) Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings. External Links: Cited by: §2.
-  (2011) Hierarchical semantic indexing for large scale image retrieval.. CVPR, pp. 785–792. Cited by: item 3.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §0.A.1, §1, §4.2, §4.
-  (2018-11) Mean Local Group Average Precision (mLGAP): A New Performance Metric for Hashing-based Retrieval. External Links: Cited by: §3.2.
-  (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th international conference on Computational Linguistics, pp. 350. Cited by: §0.A.1.
-  (1998) WordNet: an electronic lexical database. Bradford Books. Cited by: §1, §4.
-  (2013) DeViSE - A Deep Visual-Semantic Embedding Model.. NIPS. Cited by: §2, Table 2.
-  (1999) Similarity Search in High Dimensions via Hashing.. VLDB. Note: LSH Cited by: §1.
-  (2015-10) A Primer on Neural Network Models for Natural Language Processing. External Links: Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
-  (2018) Hashing as tie-aware learning to rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4023–4032. Cited by: §3.
Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567. Cited by: §4.2.
-  (2017-07) Asymmetric Deep Supervised Hashing. External Links: Cited by: §1.
-  (2018-03) Unsupervised Semantic Deep Hashing. External Links: Cited by: §1, §2.
-  (1938) A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §4.3.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1.
-  (2018-01) Dual Asymmetric Deep Hashing Learning. Note: better than ADSH scores External Links: Cited by: §1.
-  (2015) Deep hashing for compact binary codes learning.. CVPR, pp. 2475–2483. Cited by: §1.
-  (2009) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall Upper Saddle River. Cited by: §4.
-  (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. Cited by: Table 5.
-  (2013) Fast exact search in hamming space with multi-index hashing. IEEE transactions on pattern analysis and machine intelligence 36 (6), pp. 1107–1119. Cited by: item 2, §1.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §1.
-  (2016-09) How should we evaluate supervised hashing?. External Links: Cited by: §1, §4.2.
-  (2019) Spreading vectors for similarity search. In International Conference on Learning Representations, External Links: Cited by: §2.
Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2556–2565. External Links: Cited by: §1.
-  (2019) Scalable zero-shot learning via binary visual-semantic embeddings. IEEE Transactions on Image Processing 28 (7), pp. 3662–3674. Cited by: item 3, §4.4, Table 5.
-  (2017-10) Label Embedding Network: Learning Label Representation for Soft Training of Deep Networks. External Links: Cited by: Table 2, Table 3.
-  (2016-06) A Survey on Learning to Hash. External Links: Cited by: §1.
-  (2006) A nearest-neighbor approach to estimating divergence between continuous random vectors. In 2006 IEEE International Symposium on Information Theory, pp. 242–246. Cited by: item 2, §3.3, §4.2.
A Discriminative Feature Learning Approach for Deep Face Recognition.. ECCV 9911 (12), pp. 499–515. Cited by: Table 2, Table 3.
-  (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §0.A.1.
-  (2015-08) Learning Meta-Embeddings by Using Ensembles of Embedding Sets. External Links: Cited by: §2.
-  (2018-03) Deep Class-Wise Hashing: Semantics-Preserving Hashing via Class-wise Loss. External Links: Cited by: §2.
-  (2019-01) Semantic Hierarchy Preserving Deep Hashing for Large-scale Image Retrieval. External Links: Cited by: §2, Table 2.