Hierarchy-based Image Embeddings for Semantic Image Retrieval

Deep neural networks trained for classification have been found to learn powerful image representations, which are also often used for other tasks such as comparing images w.r.t. their visual similarity. However, visual similarity does not imply semantic similarity. In order to learn semantically discriminative features, we propose to map images onto class centroids whose pair-wise dot products correspond to a measure of semantic similarity between classes. Such an embedding would not only improve image retrieval results, but could also facilitate integrating semantics for other tasks, e.g., novelty detection or few-shot learning. We introduce a deterministic algorithm for computing the class centroids directly based on prior world-knowledge encoded in a hierarchy of classes such as WordNet. Experiments on CIFAR-100 and ImageNet show that our learned semantic image embeddings improve the semantic consistency of image retrieval results by a large margin.



There are no comments yet.


page 2

page 14

page 15


Learning Non-Metric Visual Similarity for Image Retrieval

Can a neural network learn the concept of visual similarity? In this wor...

NIST: An Image Classification Network to Image Semantic Retrieval

This paper proposes a classification network to image semantic retrieval...

Not just a matter of semantics: the relationship between visual similarity and semantic similarity

Knowledge transfer, zero-shot learning and semantic image retrieval are ...

A Convolutional Architecture for 3D Model Embedding

During the last years, many advances have been made in tasks like3D mode...

Evaluating Contrastive Models for Instance-based Image Retrieval

In this work, we evaluate contrastive models for the task of image retri...

Order-Embeddings of Images and Language

Hypernymy, textual entailment, and image captioning can be seen as speci...

Semantics Preserving Hierarchy based Retrieval of Indian heritage monuments

Monument classification can be performed on the basis of their appearanc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Comparison of image retrieval results on CIFAR-100 [18]

for 3 exemplary queries using features extracted from a variant of ResNet-110

[13] trained for classification and semantic embeddings learned by our method. The border colors of the retrieved images correspond to the semantic similarity between their class and the class of the query image (with dark green being most similar and dark red being most dissimilar). It can be seen that hierarchy-based semantic image embeddings lead to much more semantically consistent retrieval results.

During the past few years, deep convolutional neural networks (CNNs) have continuously advanced the state-of-the-art in image classification

[19, 30, 13, 15] and many other tasks. The intermediate image representations learned by such CNNs trained for classification have also proven to be powerful image descriptors for retrieving images from a database that are visually or semantically similar to one or more query images given by the user [2, 29, 1]. This task is called content-based image retrieval (CBIR) [31].

Usually, the categorical cross-entropy loss after a softmax activation is used for training CNN classifiers, which results in well-separable features. However, these features are not necessarily discriminative, , the inter-class variance of the learned representations may be small compared to the intra-class variance

[35]. The average distance between the classes “cat” and “dog” in feature space may well be equally large as the distance between “cat” and “forest”. For image retrieval, that situation is far from optimal, since the nearest neighbors of a certain image in feature space may belong to completely different classes. This can, for instance, be observed in the upper row of each example in Fig. 1, where we used features extracted from the global average pooling layer of ResNet-110 [13] for image retrieval on the CIFAR-100 dataset [18]. While the first results often belong to the same class as the query, the semantic relatedness of the results at later positions deteriorates significantly. Those seem to be mainly visually similar to the query with respect to shape and color, but not semantically.

Many authors have therefore proposed a variety of metric learning losses, aiming to increase the separation of classes, while minimizing the distances between samples from the same class. Popular instances of this branch of research are the contrastive loss [6], the triplet loss [27], and the quadruplet loss [5]. However, these methods require sampling of hard pairs, triplets, or even quadruplets of images, making training cumbersome and expensive. On the other hand, they still do not impose any constraints on inter-class relationships: Though images of the same class should be tightly clustered together in feature space, neighboring clusters may still be completely unrelated.

This harms the semantic consistency of image retrieval results. Consider, for example, a query image showing a poodle. It is unclear, whether the user is searching for images of poodles only, for images of any other dog breed, or even for images of animals in general. Current CBIR methods would first retrieve all images of the same class as the query—in the ideal case—but then continue with mostly unrelated, only visually similar images from other classes.

In this work, we propose a method for learning semantically meaningful image representations, so that the Euclidean distance in the feature space directly corresponds to the semantic dissimilarity of the classes. Since “semantic similarity” is, in principle, arbitrary, we rely on prior knowledge about the physical world encoded in a hierarchy of classes. Such hierarchies are readily available in many domains, since various scientific disciplines have been striving towards organizing the entities of the world in ontologies for years. WordNet [10], for example, is well-known for its good coverage of the world with over 80,000 semantic concepts and the Wikispecies project111https://species.wikimedia.org/ provides a taxonomy of living things comprising more than half a million nodes.

We make use of the knowledge explicitly encoded in such hierarchies to derive a measure of semantic similarity between classes and introduce an algorithm for explicitly computing target locations for all classes on a unit hypersphere, so that the pair-wise dot products of their embeddings equal their semantic similarities. We then learn a transformation from the space of color images to this semantically meaningful feature space, so that the correlation between image features and their class-specific target embedding is maximized. This can easily be done using the negated dot product as a loss function, without having to consider pairs or triplets of images or hard-negative mining.

Exemplary retrieval results of our system are shown in the bottom rows of each example in Fig. 1. It can be seen that our semantic class embeddings lead to image features that are much more invariant against superficial visual differences: A green apple is semantically more similar to oranges than orange tulips and an oak is more similar to a palm tree than a spider. The results follow the desired scheme described above: For a query image showing an orange, all oranges are retrieved first, then all apples, pears, and other fruits. As a last example, incorporating semantic information about classes successfully helps avoiding the not so uncommon mistake of confusing humans with apes.

In the following, we first discuss related work on learning semantic image representations in Section 2. Our approach for computing class embeddings based on a class hierarchy is presented in Section 3, and Section 4 explains how we learn to map images onto those embeddings. Experiments on CIFAR-100 [18] and ILSVRC 2012 [26] are presented in Section 5, and Section 6 concludes this work.

Our source code and pre-trained models will be released upon formal publication of this paper.

2 Related Work

Since the release of the ImageNet dataset [8] in 2009, many authors have proposed to leverage the WordNet ontology, which the classes of ImageNet have been derived from, for improving classification and image retrieval: The creators of ImageNet, Deng et al. [7]

, derived a deterministic bilinear similarity measure from the taxonomy of classes for comparing image feature vectors composed of class-wise probability predictions. This way, images assigned to different classes can still be considered as similar if the two classes are similar to each other.

Regarding classification, Zhao et al. [37]

modify multinomial logistic regression to take the class structure into account, using the dissimilarity of classes as misclassification cost. Verma et al.

[33], on the other hand, learn a specialized Mahalanobis distance metric for each node in a class hierarchy and combine them along the paths from the root to the leafs. This results in different metrics for all classes and only allows for nearest-neighbor-based classification methods. In contrast, Chang et al. [4] learn a global Mahalanobis metric on the space of class-wise probability predictions, where they enforce margins between classes proportional to their semantic dissimilarities. HD-CNN [36] follows an end-to-end learning approach by dividing a CNN into a coarse and several fine classifiers based on two levels of a class hierarchy, fusing predictions at the end.

All these approaches exploit the class structure for classification or retrieval, but only at the classifier level instead of the features themselves. Our approach, in contrast, embeds images into a semantically meaningful space where the dot product corresponds to the similarity of classes. This does not only make semantic image retrieval straightforward, but also enables the application of a wide variety of existing methods that rely on metric feature spaces, , clustering or integration of relevance feedback into the retrieval [9].

A very similar approach has been taken by Weinberger et al. [34]

, who propose “taxonomy embeddings” (“taxem”) for categorization of documents. However, they do not specify how they obtain semantic class similarities from the taxonomy. Moreover, they learn a linear transformation from hand-crafted document features onto the class embedding space using ridge regression, whereas we perform end-to-end learning using neural networks.

More recently, several authors proposed to jointly learn embeddings of classes and images based solely on visual information, , using the “center loss” [35] or a “label embedding network” [32]. However, semantics are often too complex for being derived from visual information only. For example, the label embedding network [32] learned that pears are similar to bottles, because their shape and color is often similar and the image information alone is not sufficient for learning that fruits and man-made containers are fundamentally different concepts.

To avoid such issues, Frome et al. (“DeViSE” [11]) and Li et al. [20] propose to incorporate prior world-knowledge by mapping images onto word embeddings of class labels learned from text corpora [22, 24]. To this end, Frome et al. [11] need to pre-train their image embeddings for classification initially, and Li et al. [20] first perform region mining and then use three sophisticated loss functions, requiring the mining of either hard pairs or triplets.

In contrast to this expensive training procedure relying on the additional input of huge text corpora, we show how to explicitly construct class embeddings based on prior knowledge encoded in an easily obtainable hierarchy of classes, without the need to learn such embeddings approximately. These embeddings also allow for straightforward learning of image representations by simply maximizing the dot product of image and class embeddings.

A broader overview of research aiming for incorporating prior knowledge into deep learning of image representations can, for instance, be found in


3 Hierarchy-based Class Embeddings

In the following, we first describe how we measure semantic similarity between classes based on a hierarchy and then introduce our method for explicitly computing class embeddings based on those pair-wise class similarities.

3.1 Measuring Semantic Similarity

Let be a directed acyclic graph with nodes and edges , specifying the hyponymy relation between semantic concepts. In other words, an edge means that is a sub-class of . The actual classes of interest are a subset of the semantic concepts. An example for such a graph, with the special property of being a tree, is given in Fig. 1(a).

A commonly used measure for the dissimilarity of classes organized in this way is the height of the sub-tree rooted at the lowest common subsumer (LCS) of two classes, divided by the height of the hierarchy [7, 33]:


where the height of a node is defined as the length of the longest path from that node to a leaf. The LCS of two nodes is the ancestor of both nodes that does not have any child being an ancestor of both nodes as well.

Since is bounded between 0 and 1, we can easily derive a measure for semantic similarity between semantic concepts as well:


For example, the toy hierarchy in Fig. 1(a) has a total height of 3, the LCS of the classes “dog” and “cat” is “mammal” and the LCS of “dog” and “trout” is “animal”. It follows that and .

Note that though is symmetric and non-negative, it is only guaranteed to be a proper metric if is a tree with all classes of interest being leaf nodes (the proof can be found in Appendix A). Some well-known ontologies such as WordNet [10] do not have this property and violate the triangle inequality. For instance, the WordNet synset “golfcart” is a hypernym of both “vehicle” and “golf equipment”. It is hence similar to cars and golf balls, while both are not similar to each other at all.

Our goal is to embed classes onto a unit hypersphere so that their dot product corresponds to . Thus, the Euclidean distance between such embeddings of the classes equals , which hence has to be a metric. Therefore, we assume the hierarchy to be given as tree in the following. In case that this assumption does not hold, approaches from the literature for deriving tree-shaped hierarchies from ontologies such as WordNet could be employed. For instance, YOLO-9000 [25] starts with a tree consisting of the root-paths of all concepts which have only one such path and then successively adds paths to other concepts that result in the least number of nodes added to the existing tree.

Figure 2: subfig:toy-hierarchy A toy hierarchy and subfig:toy-embedding-3 an embedding of 3 classes from this hierarchy with their pair-wise .

3.2 Class Embedding Algorithm

Consider classes embedded in a hierarchy as above. Our aim is to compute embeddings of all classes , , so that


In other words, the correlation of class embeddings should equal the semantic similarity of the classes and all embeddings should be L2-normalized. Eq. (4) actually is a direct consequence of (3) in combination with the fact that for any class , but we formulate it as an explicit constraint here for clarity, emphasizing that all class embeddings lie on a unit hypersphere. This does not only allow us to use the negated dot product as a substitute for the Euclidean distance, but also accounts for the fact that L2 normalization has proven beneficial for CBIR in general, because the direction of high-dimensional feature vectors often carries more information than their magnitude [17, 16, 14].

0:  hierarchy , set of classes
0:  embeddings for
2:  for  to  do
3:      solution of Eq. 5
4:      maximum of the solutions of Eq. 6
6:  end for
Algorithm 1 Hierarchy-based Class Embeddings

We follow a step-wise approach for computing the embeddings , as outlined in Algorithm 1: We can always choose an arbitrary point on the unit -sphere as embedding for the first class , because the constraint (3) is invariant against arbitrary rotations and translations. Here, we choose , which will simplify further calculations.

The remaining classes , , are then placed successively so that each new embedding has the correct dot product with any other already embedded class:


This is a system of linear equations, where is the vector of unknown variables. As per our construction, only the first dimensions of all , , are non-zero, so that the effective number of free variables is . The system is hence well-determined and in lower-triangular form, so that it has a unique solution that can be computed efficiently with floating-point operations using forward substitution.

This solution for the first coordinates of already fulfills (3), but not (4), , it is not L2-normalized. Thus, we use an additional dimension to achieve normalization:


Without loss of generality, we always choose the non-negative solution of this equation, so that all class embeddings lie entirely in the positive orthant of the feature space.

Due to this construction, exactly feature dimensions are required for computing a hierarchy-based embedding of classes. An example of such an embedding for 3 classes is given in Fig. 1(b). The overall complexity of the algorithm outlined in Algorithm 1 is .

3.3 Low-dimensional Approximation

In settings with several thousands of classes, the number of features required by our algorithm from Section 3.2 might become infeasible. However, it is possible to obtain class embeddings of arbitrary dimensionality whose pair-wise dot products best approximate the class similarities.

Let be the matrix of pair-wise class similarities, , and be a matrix whose -th row is the embedding for class . Then, we can reformulate (3) as



is a matrix whose rows contain the eigenvectors of


is a diagonal matrix containing the corresponding eigenvalues.

Thus, we could also use the eigendecomposition of to obtain the class embeddings as . However, we have found our algorithm presented in Section 3.2 to provide better numerical accuracy, resulting in a maximum error of pairwise distances of for the 1000 classes of ILSVRC [26], while the eigendecomposition only obtains

. Moreover, eigendecomposition does not guarantee all class embeddings to lie in the positive orthant of the feature space. However, since most modern neural network architectures use ReLU activation functions, we have found the strictly positive class embeddings computed by our algorithm to result in slightly better performance.

On the other hand, when dealing with a large number of classes, the eigendecomposition can be useful to obtain a low-dimensional embedding that does not reconstruct exactly, but approximates it as best as possible, by keeping only the eigenvectors corresponding to the largest eigenvalues. However, the resulting class embeddings will not be L2-normalized any more. Experiments in Appendix D show that our method can still provide superior retrieval performance even with very low-dimensional embeddings, which is also advantageous regarding memory consumption when dealing with large datasets.

4 Mapping Images onto Class Centroids

Knowing the target embeddings for all classes, we need to find a transformation from the space of images into the hierarchy-based semantic embedding space, so that image features are close to the centroid of their class.

4.1 Loss Function

For modeling , we employ convolutional neural networks (CNNs) whose last layer has output channels and no activation function. Since we embed all classes onto the unit hypersphere, the last layer is followed by L2 normalization of the feature vectors. The network is trained on batches of images with class labels , , under the supervision of a simple loss function that enforces similarity between the learned image representations and the semantic embedding of their class:


Note that the class centroids are computed beforehand using the algorithm described in Section 3 and fixed during the training of the network. This allows our loss function to be used stand-alone, as opposed to, for example, the center loss [35], which requires additional supervision. Moreover, the loss caused by each sample is independent from all other samples, so that expensive mining of hard pairs [6] or triplets [27] of samples is not necessary.

Features learned this way may not only be used for retrieval but also for classification by assigning a sample to the class whose embedding is closest in the feature space. However, one might as well add a fully-connected layer with softmax activation on top of the embedding layer, producing output . The network could then simultaneously be trained for computing semantic image embeddings and classifying images using a combined loss


where denotes the categorical cross-entropy loss function


Since we would like the embedding loss to dominate the learning of image representations, we set to a small value of in our experiments.

5 Experiments

In the following, we present results on three different datasets and compare our method for learning semantic image representations with features learned by other methods that try to take relationships among classes into account.

5.1 Datasets and Setups

5.1.1 Cifar-100

The extensively benchmarked CIFAR-100 dataset [18] consists of 100 classes with 500 training and 100 test images each. In order to make our approach applicable to this dataset, we created a taxonomy for the set of classes, mainly based on the WordNet ontology [10] but slightly simplified and with a strict tree structure. A visualization of that taxonomy can be found in Appendix E.

We test our approach with 3 different architectures designed for this dataset [18], while, in general, any neural network architecture can be used for mapping images onto the semantic space of class embeddings:

  • Plain-11 [3], a recently proposed strictly sequential, shallow, wide, VGG-like network consisting of only 11 trainable layers, which has been found to achieve classification performance competitive to ResNet-110 when trained using cyclical learning rates.

  • A variant of ResNet-110 [13], a deep residual network with 110 trainable layers. We use 32, 64, and 128 instead of 16, 32, and 64 channels per block, so that the number of features before the final fully-connected layer is greater than the number of classes. To make this difference obvious, we refer to this architecture as “ResNet-110w” in the following (“w” for “wide”).

  • PyramidNet-272-200 [12], a deep residual network whose number of channels increases with every layer and not just after pooling.

Following [3], we train these networks using SGD with warm restarts (SGDR [21]), starting with a base learning rate of

and smoothly decreasing it over 12 epochs to a minimum of

using cosine annealing. The next cycle then begins with the base learning rate and the length of the cycles is doubled at the end of each one. All network architectures are trained over a total number of 5 cycles ( epochs) using a batch-size of

. To prevent divergence caused by the initially high learning rate, we employ gradient clipping

[23] and restrict the norm of gradients to a maximum of .

5.1.2 North American Birds

The North American Birds (NAB) dataset222http://dl.allaboutbirds.org/nabirds comprises 23,929 training and 24,633 test images showing birds from 555 different species. A 5 levels deep hierarchy of those classes is provided with the dataset.

One of the challenges of this dataset is that most birds only take up a small part of the image, so that they would need to be detected first. However, since we are not interested in beating the state-of-the-art classification accuracy on this dataset but rather want to show the benefits of hierarchy-based semantic image embeddings, we crop all images to the birds using the bounding box annotations provided with the dataset.

We use the ResNet-50 architecture, but do not pre-train it on external data, as usually done due to the small number of NAB training images. Instead, we train the network from scratch on NAB only. All images are scaled to pixels and randomly cropped to pixels. During training, we also apply random horizontal flipping and random erasing [38] for data augmentation.

The network is trained using SGDR as described above for 4 cycles ( epochs) using 64 samples per batch.

5.1.3 Ilsvrc 2012

We also conduct experiments on data from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012, which comprises over 1.1 million training and 50,000 test images from 1,000 classes. The classes originate from the WordNet ontology. Since this taxonomy is not a tree, we used the method of Redmon & Farhadi [25] described in Section 3.1 for deriving a tree-shaped hierarchy from WordNet. For the evaluation, however, we used the full WordNet taxonomy for computing class similarities.

We employ the ResNet-50 architecture [13] with a batch size of 128 images but only use a single cycle of SGDR, , just the cosine annealing without warm restarts, starting with a learning rate of and annealing it down to over 80 epochs. We used the same data augmentation techniques as He [13] (flipping, scale augmentation, random cropping), except color augmentation, and use the weights provided by them as classification-based baseline to compare our method with.

Figure 3: Hierarchical precision on CIFAR-100 at several cutoff points k for various network architectures.

5.2 Performance Metrics

Image retrieval tasks are often evaluated using the precision of the top k results (P@k) and mean average precision (mAP). However, these metrics do not take similarities among classes and varying misclassification costs into account. Thus, we introduce variants of them that are aware of the semantic relationships among classes.

Let be a query image belonging to class and denote an ordered list of retrieved images and their associated classes . Following Deng et al. [7], we define the hierarchical precision at k (HP@k) for as


where denotes any permutation of the indices from to . Thus, the denominator in Eq. 11 normalizes the hierarchical precision through division by the sum of the precisions of the best possible ranking.

For several values of k, HP@k can be plotted as a curve for gaining a more detailed insight into the behavior of a retrieval system regarding the top few results (short-term performance) and the overall ranking (long-term performance). We denote the area under that curve ranging from to as average hierarchical precision at (AHP@K). It can be used to compare the semantic performance of image retrieval systems by means of a single number. In the following, we always report the AHP@250, because we do not expect the typical user to inspect more than 250 results.

5.3 Competitors

We evaluate two variants of our hierarchy-based semantic image embeddings: trained with only (eq. Eq. 8) or trained with the combination of correlation and categorical cross-entropy loss, , according to Eq. 9.

As a baseline, we compare them with features extracted from the last layer right before the final classification layer of the same network architecture, but trained solely for classification. We also evaluate the performance of L2-normalized versions of these classification-based features, which are usually used for CBIR (Section 1).

Moreover, we compare with DeViSE [11], the center loss [35], and label embedding networks [32]. All methods have been applied to identical network architectures and trained with exactly the same optimization strategy, except DeViSE, which requires special training.

Since DeViSE [11] learns to map images onto word embeddings learned from text corpora, we were only able to apply it on CIFAR-100, since about one third of the 1,000 ILSVRC classes could not be matched to word embeddings automatically and the scientific bird names of the NAB dataset are not part of the vocabulary either. For CIFAR-100, however, we used pre-computed GloVe word embeddings [24] learned from Wikipedia. In order to be comparable to the 100-dimensional image embeddings learned by our method (equal to the number of classes), we have used GloVe embeddings with 100 dimensions as well.

5.4 Semantic Image Retrieval Performance

Plain-11 ResNet-110w PyramidNet
Classification-based 0.5458 0.7261 0.6775 0.3625 0.6831
Classification-based + L2 Norm 0.5980 0.7468 0.7334 0.3786 0.7132
DeViSE [11] 0.6267 0.7348 0.7116
Center Loss [35] 0.6762 0.6815 0.6227 0.4275 0.4094
Label Embedding [32] 0.6825 0.7950 0.7888 0.5197 0.4769
Semantic Embeddings () [ours] 0.8183 0.8290 0.8653 0.7885 0.7902
Semantic Embeddings () [ours] 0.8309 0.8335 0.8638 0.8116 0.8242
Table 1: Retrieval performance of different image features in mAHP@250. The best value per column is set in bold font.

For all datasets, we used each of the test images as individual query, aiming to retrieve semantically similar images from the remaining ones. Retrieval is performed by ranking the images in the database decreasingly according to their dot product with the query image in the feature space.

The mean values of HP@k on CIFAR-100 over all queries at the first 250 values of k are reported in Fig. 3. It can clearly be seen that our hierarchy-based semantic embeddings achieve a much higher hierarchical precision than all competing methods. While the differences are rather small when considering only very few top-scoring results, our method maintains a much higher precision when taking more retrieved images into account.

There is an interesting turning point in the precision curves at , where only our method suddenly starts increasing hierarchical precision. This is because there are exactly 100 test images per class, so that retrieving only images from exactly the same category as the query is not sufficient any more after this point. Instead, a semantically-aware retrieval method should continue retrieving images from the most similar classes at the later positions in the ranking, of which only our method is capable.

While additionally taking a classification objective into account during training improves the semantic retrieval performance on CIFAR-100 only slightly compared to using our simple loss function alone, it is clearly beneficial on NAB and ILSVRC. This can be seen from the mAHP@250 reported for all datasets in Table 1. Our approach outperforms the second-best method for every datasets and network architecture by between 5% and 22% relative improvement of mAHP@250 on CIFAR-100, 56% on NAB, and 16% on ILSVRC.

We also found that our approach outperforms all competitors in terms of classical mAP, which does not take class similarities into account. The detailed results as well as qualitative examples for ILSVRC are provided in Appendices C and B. Meanwhile, Fig. 1 shows some qualitative results on CIFAR-100.

Plain-11 ResNet-110w PyramidNet
Classification-based 73.73% 76.95% 81.44% 53.25% 73.87%
DeViSE [11] 69.75% 74.66% 77.32%
Center Loss [35] 75.06% 75.18% 76.83% 59.28% 70.05%
Label Embedding [32] 75.57% 76.96% 79.35% 67.61% 70.94%
Semantic Embeddings () [ours] 73.83% 75.03% 79.87% 64.71% 48.97%
Semantic Embeddings () [ours] 75.31% 76.24% 80.49% 71.80% 69.18%
Table 2: Classification accuracy of various methods. The best value in every column is set in bold font.

5.5 Classification Performance

While our method achieves superior performance in the scenario of content-based image retrieval, we also wanted to make sure that it does not sacrifice classification performance. Though classification was not the objective of our work, we have hence also compared the classification accuracy obtained by all tested methods on all datasets and show the results in Table 2. It can be seen that learning to map images onto the semantic space of class embeddings does not impair classification performance unreasonably.

For DeViSE, which does not produce any class predictions, we have trained a linear SVM on top of the extracted features. The same issue arises for hierarchy-based semantic embeddings trained with , where we performed classification by assigning images to the class whose embedding has the largest dot product with the image feature vector. Though doing so gives fair results, the accuracy becomes more competitive when training with instead, , a combination of the embedding and the classification objective.

In the case of the NAB dataset, our method even obtains the best classification accuracy. Incorporating semantic information about class similarities obviously facilitates training in this case, but further investigation is needed to determine whether this is due to the fine-grained nature of the task or the limited amount of training data.

6 Conclusions and Future Work

We proposed a novel method for integrating basic prior knowledge about the semantic relationships between classes given as class taxonomy into deep learning. Our hierarchy-based semantic embeddings preserve the semantic similarity of the classes in the joint space of image and class embeddings and thus allow for retrieving images from a database that are not only visually, but also semantically similar to a given query image. This avoids unrelated matches and improves the quality of content-based image retrieval results significantly compared with other recent representation learning methods.

In contrast to other often used class representations such as text-based word embeddings, the hierarchy-based embedding space constructed by our method allows for straightforward training of neural networks by learning a regression of the class centroids using a simple loss function involving only a dot product.

The learned image features have also proven to be suitable for image classification, providing performance similar to that of networks trained explicitly for that task only.

Since the semantic target feature space is, by design, very specify to the classes present in the training set, generalization w.r.t. novel classes is still an issue. It thus seems promising to investigate the use of activations at earlier layers in the network, which we expect to be more general.


  • [1] A. Babenko and V. Lempitsky.

    Aggregating local deep features for image retrieval.

    In IEEE International Conference on Computer Vision (ICCV), pages 1269–1277, 2015.
  • [2] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky. Neural codes for image retrieval. In European Conference on Computer Vision (ECCV), pages 584–599. Springer, 2014.
  • [3] B. Barz and J. Denzler. Deep learning is not a matter of depth but of good training. In

    International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI)

    , pages 683–687. CENPARMI, Concordia University, Montreal, 2018.
  • [4] J. Y. Chang and K. M. Lee. Large margin learning of hierarchical semantic similarity for image classification. Computer Vision and Image Understanding, 132:3–11, 2015.
  • [5] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 403–412, 2017.
  • [6] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 539–546. IEEE, 2005.
  • [7] J. Deng, A. C. Berg, and L. Fei-Fei. Hierarchical semantic indexing for large scale image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 785–792. IEEE, 2011.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009.
  • [9] T. Deselaers, R. Paredes, E. Vidal, and H. Ney. Learning weighted distances for relevance feedback in image retrieval. In International Conference on Pattern Recognition (ICPR), pages 1–4. IEEE, 2008.
  • [10] C. Fellbaum. WordNet. Wiley Online Library, 1998.
  • [11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS), pages 2121–2129, 2013.
  • [12] D. Han, J. Kim, and J. Kim. Deep pyramidal residual networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5927–5935, 2017.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [14] S. Horiguchi, D. Ikami, and K. Aizawa. Significance of softmax-based features in comparison to distance metric learning-based features. arXiv preprint arXiv:1712.10151, 2017.
  • [15] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. 2017.
  • [16] S. S. Husain and M. Bober. Improving large-scale image retrieval through robust aggregation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(9):1783–1796, 2017.
  • [17] H. Jégou and A. Zisserman. Triangulation embedding and democratic aggregation for image search. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3310–3317. IEEE, 2014.
  • [18] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
  • [20] D. Li, H.-Y. Lee, J.-B. Huang, S. Wang, and M.-H. Yang. Learning structured semantic embeddings for visual recognition. arXiv preprint arXiv:1706.01237, 2017.
  • [21] I. Loshchilov and F. Hutter.

    Sgdr: Stochastic gradient descent with warm restarts.

    In International Conference on Learning Representations (ICLR), 2017.
  • [22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS), pages 3111–3119, 2013.
  • [23] R. Pascanu, T. Mikolov, and Y. Bengio.

    On the difficulty of training recurrent neural networks.


    International Conference on Machine Learning (ICML)

    , pages 1310–1318, 2013.
  • [24] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1532–1543, 2014.
  • [25] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016.
  • [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [27] F. Schroff, D. Kalenichenko, and J. Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
  • [28] F. Setti. To know and to learn – about the integration of knowledge representation and deep learning for fine-grained visual categorization. In VISIGRAPP (5: VISAPP), pages 387–392, 2018.
  • [29] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR-WS), pages 806–813, 2014.
  • [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [31] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(12):1349–1380, 2000.
  • [32] X. Sun, B. Wei, X. Ren, and S. Ma. Label embedding network: Learning label representation for soft training of deep networks. arXiv preprint arXiv:1710.10393, 2017.
  • [33] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair. Learning hierarchical similarity metrics. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2280–2287. IEEE, 2012.
  • [34] K. Q. Weinberger and O. Chapelle. Large margin taxonomy embedding for document categorization. In Advances in Neural Information Processing Systems (NIPS), pages 1737–1744, 2009.
  • [35] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision (ECCV), pages 499–515. Springer, 2016.
  • [36] Z. Yan, H. Zhang, R. Piramuthu, V. Jagadeesh, D. DeCoste, W. Di, and Y. Yu. Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition. In IEEE International Conference on Computer Vision (ICCV), pages 2740–2748, 2015.
  • [37] B. Zhao, F. Li, and E. P. Xing. Large-scale category structure aware image categorization. In Advances in Neural Information Processing Systems (NIPS), pages 1251–1259, 2011.
  • [38] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang. Random erasing data augmentation. arXiv preprint arXiv:1708.04896, 2017.

Appendix A applied to trees is a metric

Theorem 1.

Let be a directed acyclic graph whose egdes define a hyponymy relation between the semantic concepts in . Furthermore, let have exactly one unique root node with indegree . The lowest common subsumer of two concepts hence always exists. Moreover, let denote the maximum length of a path from to a leaf node, and be a set of classes of interest.

Then, the semantic dissimilarity between classes given by


is a proper metric if

  1. [label=(), ref=()]

  2. is a tree, , all nodes have indegree , and

  3. all classes of interest are leaf nodes of the hierarchy, , all have outdegree .


For being a proper metric, must possess the following properties:

  1. [label=(), ref=()]

  2. Non-negativity: .

  3. Symmetry: .

  4. Identity of indiscernibles: .

  5. Triangle inequality: .

The conditions 1 and 2 are always satisfied since is defined as the length of a path, which cannot be negative, and the lowest common subsumer (LCS) of two nodes is independent of the order of arguments.

The proof with respect to the remaining properties 3 and 4 can be conducted as follows:



Let be two classes with . This means that their LCS has height 0 and hence must be a leaf node. Because leaf nodes have, by definition, no further children, . On the other hand, for any class , because and is a leaf node according to 2.


Let be three classes. Due to 1, there exists exactly one unique path from the root of the hierarchy to any node. Hence, and both lie on the path from to and they are, thus, either identical or one is an ancestor of the other. Without loss of generality, we assume that is an ancestor of and thus lies on the root-paths to , , and . In particular, is a subsumer of and and, therefore, . In general, it follows that , given the non-negativity of .

Remark regarding the inversion

If is a metric, all classes of interest must necessarily be leaf nodes, since .

However, 41 does not hold in general, since may even be a metric for graphs that are not trees. An example is given in Fig. 3(a). Nevertheless, most such graphs violate the triangle inequality, like the example shown in Fig. 3(b).

(a) A non-tree hierarchy where is a metric.
(b) A non-tree hierarchy where violates the triangle inequality.
Figure 4: Examples for non-tree hierarchies.

Appendix B Further Quantitative Results

Figure 5: Hierarchical precision on NAB and ILSVRC 2012.
Plain-11 ResNet-110w PyramidNet
Classification-based 0.2078 0.4870 0.3643 0.0877 0.2184
Classification-based + L2 Norm 0.2666 0.5305 0.4621 0.1047 0.2900
DeViSE 0.2879 0.5016 0.4131
Center Loss 0.4180 0.4153 0.3029 0.1468 0.1285
Label Embedding 0.2747 0.6202 0.5920 0.2515 0.2683
Semantic Embeddings () [ours] 0.5660 0.5900 0.6642 0.5000 0.3037
Semantic Embeddings () [ours] 0.5886 0.6107 0.6808 0.5748 0.4508
Table 3: Classical mean average precision (mAP) on all datasets. The best value per column is set in bold font. Obviously, optimizing for a classification criterion only does not necessarily lead to features that are suitable for semantic image retrieval.

Appendix C Qualitative Results on ILSVRC 2012

Figure 6: Comparison of a subset of the top 100 retrieval results using L2-normalized classification-based and our hierarchy-based semantic features for 3 exemplary queries on ILSVRC 2012. Image captions specify the ground-truth classes of the images and the border color encodes the semantic similarity of that class to the class of the query image, with dark green being most similar and dark red being most dissimilar.
Image Classification-based Semantic Embeddings (ours)
1. Giant Panda (1.00) 1. Giant Panda (1.00)
2. American Black Bear (0.63) 2. Lesser Panda (0.89)
3. Ice Bear (0.63) 3. Colobus (0.58)
4. Gorilla (0.58) 4. American Black Bear (0.63)
5. Sloth Bear (0.63) 5. Guenon (0.58)
1. Great Grey Owl (1.00) 1. Great Grey Owl (1.00)
2. Sweatshirt (0.16) 2. Kite (0.79)
3. Bonnet (0.16) 3. Bald Eagle (0.79)
4. Guenon (0.42) 4. Vulture (0.79)
5. African Grey (0.63) 5. Ruffed Grouse (0.63)
1. Monarch (1.00) 1. Monarch (1.00)
2. Earthstar (0.26) 2. Cabbage Butterfly (0.84)
3. Coral Fungus (0.26) 3. Admiral (0.84)
4. Stinkhorn (0.26) 4. Sulphur Butterfly (0.84)
5. Admiral (0.84) 5. Lycaenid (0.84)
1. Ice Bear (1.00) 1. Ice Bear (1.00)
2. Arctic Fox (0.63) 2. Brown Bear (0.95)
3. White Wolf (0.63) 3. Sloth Bear (0.95)
4. Samoyed (0.63) 4. Arctic Fox (0.63)
5. Great Pyrenees (0.63) 5. American Black Bear (0.95)
1. Ice Cream (1.00) 1. Ice Cream (1.00)
2. Meat Loaf (0.63) 2. Ice Lolly (0.84)
3. Bakery (0.05) 3. Trifle (0.89)
4. Strawberry (0.32) 4. Chocolate Sauce (0.58)
5. Fig (0.32) 5. Plate (0.79)
1. Cocker Spaniel (1.00) 1. Cocker Spaniel (1.00)
2. Irish Setter (0.84) 2. Sussex Spaniel (0.89)
3. Sussex Spaniel (0.89) 3. Irish Setter (0.84)
4. Australien Terrier (0.79) 4. Welsh Springer Spaniel (0.89)
5. Clumber (0.89) 5. Golden Retriever (0.84)
Figure 7: Top 5 classes predicted for several example images by a ResNet-50 trained purely for classification and by our network trained with incorporating semantic information. The correct label for each image is underlined and the numbers in parentheses specify the semantic similarity of the predicted class and the correct class. It can be seen that class predictions made based on our hierarchy-based semantic embeddings are much more relevant and consistent.

Appendix D Low-dimensional Semantic Embeddings

Figure 8: Hierarchical precision of our method for learning image representations based on class embeddings with varying dimensionality, compared with the usual baselines.

As can be seen from the description of our algorithm for computing class embeddings in Section 3.2, an embedding space with dimensions is required in general to find an embedding for classes that reproduces their semantic similarities exactly. This can become problematic in settings with a high number of classes.

For such scenarios, we have proposed a method for computing low-dimensional embeddings of arbitrary dimensionality approximating the actual relationships among classes in Section 3.3. We experimented with this possibility on the NAB dataset to see how reducing the number of features affects our algorithm for learning image representations and the semantic retrieval performance.

The results in Fig. 8 show that obtaining low-dimensional class embeddings through eigendecomposition is a viable option for settings with a high number of classes. Though the performance is worse than with the full amount of required features, our method still performs better than the competitors with as few as 16 features. Our approach hence also allows obtaining very compact image descriptors, which is important when dealing with huge datasets.

Interestingly, the 256-dimensional approximation even gives slightly better results than the full embedding after the first 80 retrieved images. We attribute this to the fact that fewer features leave less room for overfitting, so that slightly lower-dimensional embeddings can generalize better in this scenario.

Appendix E Taxonomy used for CIFAR-100