Hierarchical Image Classification using Entailment Cone Embeddings

04/02/2020 ∙ by Ankit Dhall, et al. ∙ MIT ETH Zurich 4

Image classification has been studied extensively, but there has been limited work in using unconventional, external guidance other than traditional image-label pairs for training. We present a set of methods for leveraging information about the semantic hierarchy embedded in class labels. We first inject label-hierarchy knowledge into an arbitrary CNN-based classifier and empirically show that availability of such external semantic information in conjunction with the visual semantics from images boosts overall performance. Taking a step further in this direction, we model more explicitly the label-label and label-image interactions using order-preserving embeddings governed by both Euclidean and hyperbolic geometries, prevalent in natural language, and tailor them to hierarchical image classification and representation learning. We empirically validate all the models on the hierarchical ETHEC dataset.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Hierarchy of labels from the ETHEC dataset across 4 levels: family (blue), sub-family (aqua), genus (brown) and species. For clarity, this visualisation depicts only the first 3 levels. The name of the family is displayed next to its sub-tree. Edges represent direct relations.

In deep learning, classification is typically performed by independently predicting class-probabilities (e.g., using a linear-softmax layer) and predicting the highest scoring label. Such an approach by default assumes mutually exclusive, unstructured labels. Contrary to this assumption, in many common datasets, labels have an underlying latent organization, potentially allowing hierarchical clustering into progressively more abstract concepts. Relatively few previous works use hierarchical information in the context of computer vision. Among them, in

[RedmonYOLO9000] the label-hierarchy from WordNet [miller1995wordnet] is used to consolidate data across datasets. [deng2012hedging]

show how to optimize the trade-off between accuracy and fine-grained-ness of the predicted label, but their proposed method only considers the semantic similarity and disregards visual similarity.

[Samplawski2019] use relation graph information to improve performance over a strong baseline in a zero-shot learning setting.

Incorporating the hierarchy in the model would improve generalization on classes for which training data is scarce, by leveraging shared features among hierarchically-related classes, e.g. “truck” and “car” both have wheels in their shared superclass “vehicle”. As is the case with few-shot learning approaches, sharing information and parameters among the long tail of leaf labels helps overcome this data scarcity problem.

Uncovering the black-box model

If a human is tasked with classifying an image, the natural way to proceed is to identify the membership of the image to abstract labels and then move to more fine-grained labels. Even if an untrained eye cannot tell apart an Alaskan Malamute from a Siberian Husky, it is more likely to at least get the concept of “animal” and its sub-concept “dog” correct.

Using the label hierarchy to guide the classification models we are able to bridge one gap in the way machines and humans deal with visual understanding. Incorporating such auxiliary information improves explainability and interpretability of image understanding models.

Leveraging label-label interactions

Usually, image classifiers perform flat N-way classification solely by learning to discriminate between visual signals. These models capture the label-image interactions but do not use additional information available about the inter-label interaction that could boost performance and interpretability.

Long-tailed data distributions

Real-world data is commonly characterized by imbalance. Class labels form a hierarchy and can be viewed as directed acyclic graph (DAG), where abstract labels have finer-grained descendants. Abstract levels have fewer labels and more images per label compared to their fine-grained descendants. The converse is true for fine-grained labels resulting in a long-tailed data distribution. Shallow classifiers benefit from balanced datasets, and generalize worse when classes are imbalanced. We show that image classifiers can exploit information naturally shared across data from different levels and labels.

Figure 2: Long-tailedness is evident from the image distribution across labels from the 4 levels of our hierarchy: 6 family, 21 sub-family, 135 genus and 550 species. x-axis: number of images for a particular label; y-axis: label. Genus and species labels have been omitted for clarity.
Visual similarity does not imply semantic similarity

Visual models rely on image-based features to distinguish between different objects. But, often, semantically related classes might exhibit marked visual dissimilarity. Sometimes it might even be the case that the intra-class variance of visual features for a single label is larger than the inter-class variance (we show an example in the Appendix,

Fig. 14). In such scenarios learned representations for two instances with different visual appearance would be coerced away from each other, indirectly affecting the image understanding capability of the model.

Figure 3: Sample images and their 4-level labels from the ETHEC dataset. The dataset consists of 47,978 butterfly specimens with 723 labels spread across 4 levels.

Labels with varying levels of abstraction may also be beneficial for further downstream tasks involving both natural language and computer vision such as image captioning, scene graph generation and visual-question answering (VQA). This work exploits semantic information available in the form of hierarchical labels. We show that visual models trained with such guidance outperform a hierarchy-agnostic model. We also show how these models can be more interpretable when using more explicit representations via embeddings for the task of image classification.

Our work

We propose and compare multiple approaches for incorporating hierarchical information in state-of-the-art CNN classifiers. To this end, we first compare baselines where the hierarchy is exploited in the loss function (hierarchical softmax, marginalization classifier), and then propose a set of

embedding-based approaches where images and labels are embedded in a common space. These are more flexible as they allow for entailment prediction tasks and hierarchy-based retrieval. Our embeddings are based on entailment cones, which can be embedded both in Euclidean geometry and in hyperbolic geometry. We compare these and show that the hyperbolic case has empirical advantages over the Euclidean case, while being backed up by theoretical advantages.

We summarize our contributions: (1) applying order-preserving embeddings to image classification, where both images and labels are embedded in a common space that enforces transitivity, (2) providing a set of methods to incorporate entailment cones in CNN-based classifers, including effective optimization techniques. (3) comparing entailment cones in different geometries (Euclidean and hyperbolic), highlighting their strengths and weaknesses, (4) comparing embedding-based approaches to non-embedding-based approaches, under uniform settings.

2 Related Work

Embedding-based models for text. One way to model semantic hierarchies is to use order-preserving embeddings, which enforce transitivity among hierarchically-related concepts by imposing a structure on the latent space. For instance, order-embeddings [vendrov2015order] learn hierarchical word embeddings on WordNet [miller1995wordnet]. As an alternative to common symmetric distances (e.g. Euclidean, Manhattan, or cosine), the work proposes an asymmetric distance resulting in the formation of a transitive embedding space as shown in Fig. 4. As opposed to the distance-preserving nature, the order-preserving nature of order-embeddings ensures that anti-symmetric and transitive relations can be captured well without having to rely on physical closeness between points. However, the distance function in [vendrov2015order] is limited as each concept occupies a large volume in the embedding space irrespective of its volume needs and suffers from heavy orthant intersections. This ill-effect is amplified especially in extremely low dimensions such as . To this end, [ganea2018entailment_cones] proposes Euclidean entailment cones which generalizes order-embeddings by substituting translated orthants with more flexible convex cones. Furthermore, [suzuki2019hyperbolic_disk] generalizes order-embeddings [vendrov2015order] and entailment cones [ganea2018entailment_cones] for embedding DAGs with an exponentially-increasing number of nodes.

More general and flexible methods where the embedding space is not necessarily Euclidean have also been explored. [ganea2018entailment_cones] leverage non-Euclidean geometry by learning embeddings defined by hyperbolic cones for hypernymy prediction in the WordNet hierarchy [miller1995wordnet]. In hyperbolic space, the volume of a ball grows exponentially with the radius as compared to polynomially in Euclidean space, allowing to embed exponentially-growing hierarchies in low-dimensional space. Lately, [le2019hearst_cones] combined the idea of Hearst patterns to create a graph and hyperbolic embeddings to infer and embed hypernyms from text. Hyperbolic neural networks [ganea2018hyperbolicNN]

are feed-forward neural networks parameterized in hyperbolic space that allow using hyperbolic embeddings for NLP tasks more naturally and boost the performance.

Other non-Euclidean embeddings include embeddings on surfaces, generalized multidimensional scaling on the sphere and probability embeddings [li2018smoothing, muzellec2018] which generalize point embeddings.

Embedding-based models for images. Visual-semantic embeddings, proposed in [faghri2017vse++], define a similarity measure instead of an explicit classification and return the closest concept in the embedding space for a given query. They use an LSTM and a CNN and map to a joint embeddings space through a linear mapping and measure similarity for cross-modal image-caption retrieval. [barz2018hierarchy] maps images onto class embeddings and use dot product to measure similarity. A drawback of such an approach is that the label embeddings are fixed when training on the image embeddings. The labels might be embedded properly however they might not be arranged in a way that puts visually similar labels together. Furthermore, these approaches are based on Euclidean geometry.

In contrast to general CNNs for image classification, the work done in [frome2013devise]

exploits unannotated text in addition to the images labels. They use embeddings and transfer knowledge from the text-domain to a model for visual recognition and perform zero-shot classification on an extended ImageNet dataset


Non-embedding-based approaches. While this work focuses on embedding-based approaches, there has also been work on incorporating label hierarchies in the model architecture or in loss function. [kumar2017hierarchical, chen2018finegrained, deng2014large] discuss hierarchical approaches not based on the concept of order-preserving embeddings. While these approaches can effectively exploit label hierarchies to improve performance, their hierarchies are typically fixed, integrated in the architecture of the model, and tailored to one specific downstream task (e.g. classification). On the other hand, embedding-based approaches allow for flexible hierarchies and retrieval tasks using parent-child queries.

3 Background

(a) In OE, if is , it lies within an orthant at .
(b) In EC, if is , it lies within a cone at .
Figure 4: Comparing embedding space for OE and EC.

Order-embeddings (OE). Order-embeddings [vendrov2015order] preserves the order

between objects rather than distance. From a set of ordered-pairs

and unordered-pairs the goal is to determine if an arbitrary pair is ordered. They use a reversed product order on : and approximate order-violation minimization.

Figure 5: Visualization of the label-hierarchy embedded using OE in . Node colors - cyan: family, magenta: sub-family, yellow: genus. Last level omitted for clarity.

where , and represent positive and negative edges respectively, is a margin, is a function that maps a concept to its embedding. is the energy that defines the severity of the order-violation for a given pair and is given by . According to the energy . For positive pairs where is-a , one would like embeddings such that . is-a implies that is a sub-concept of .

Euclidean Cones (EC). Euclidean cones [ganea2018entailment_cones] are a generalization of order-embeddings [vendrov2015order]

. For each vector

in , the aperture of the cone is based solely on the Euclidean norm of the vector, , [ganea2018entailment_cones] and is given by where K is a hyper-parameter. The cones can have a maximum aperture of [ganea2018entailment_cones]. To ensure continuity and transitivity, the aperture should be a smooth, non-increasing function. To satisfy properties mentioned in [ganea2018entailment_cones], the domain of the aperture function has to be restricted to for some . . Eq. 2 computes the minimum angle between the axis of the cone at and the vector . measures the cone-violation which is the minimum angle required to rotate the axis of the cone at to bring into the cone.

Figure 6: Visualization of the label-hierarchy using Euclidean cones in 2 dimensions. Color coding follows Fig. 5. genus+species nodes are omitted to visualize better.

Hyperbolic Cones (HC). The Poincaré ball is defined by the manifold . The distance between two points and the norm are

and where we use for Euclidean norm, for dot-product and for a unit vector. The angle between two tangent vectors is given by . The aperture of the cone is . computes the minimum angle between the axis of the cone at and the vector .


measures the cone-violation which is the minimum angle required to rotate the axis of the cone at to bring into the cone.

Optimization in hyperbolic space.

For parameters living in hyperbolic space, Riemannian stochastic gradient descent (RSGD)

[ganea2018entailment_cones] is used. An update involves Rimannian gradient (RG) for parameter . RG is computed by rescaling the Euclidean gradient by where [ganea2018entailment_cones]. Exponential-map at a point , , maps a point in the tangent space to the hyperbolic space:


where and , , .

4 Approach

4.1 CNN classifiers

We do not focus on specifically designed CNN components but on different ways to formulate probability distributions to pass hierarchical information.

Hierarchy-agnostic baseline classifier (HAB). As a baseline, we use SOTA residual network for image classification [he2016resnet]. The baseline is agnostic to any label hierarchy in the dataset. The model performs -way classification (see Fig. 9). represents labels across all levels and are the number of distinct labels on the i-th level. It uses the one-versus-rest strategy for each of the labels. We minimize multi-label soft-margin loss,


, . and . , where

are the logits (normalized as a probability distribution) from the last layer of a model

which takes as input image . From empirical analysis we found that choosing a single threshold for all labels is better as it is less prone to over-fitting than choosing a per-class decision boundary. Refer to Appendix 7.4.

Per-level classifier (PLC). Instead of a single -way classifier we replace it with -way classifiers where each of the classifiers handles all the labels present in level (Fig. 10). We use the multi-label soft-margin loss: .


where, is the true label for the i-th level. , . where, are the logits from the last layer of . is a continuous sub-sequence of the predicted logits , i.e. .

Marginalization classifier (MC). The notion of levels is built into the per-level classifier but it is still unaware of the relationship between nodes across levels. Here, a single classifier outputs a probability distribution over the final level in the hierarchy. Instead of having classifiers for the remaining levels, we compute the probability distribution over each one of these by summing the probability of the children nodes. Although, the network does not explicitly predict these scores, the models is still penalized for incorrect predictions across the levels. We minimize where, is the true label for the i-th level. , . where, are the logits from the last layer of .


where, is the j-th vertex in the i-th level. All but the last level use this to compute the probabilities for their labels. For the final level, we compute the probabilities over the leaf nodes by directly using the logits from the model , using . Once is determined, can be calculated in a bottom up fashion as seen in Fig. 11.

Masked Per-level classifier (M-PLC). On the upper levels of the hierarchy one has more data per label and fewer labels to choose from. Naturally, this makes classifying relatively accurate closer to the root of the hierarchy. This model exploits knowledge about the parent-child relationship between nodes in a top down manner.

Here, we have L-classifiers, one for each level. For level , the models belief about upper level is leveraged i.e. it’s prediction for level . Instead of naively predicting the label with the highest score for level (comparing among all possible logits), all nodes except the children of the predicted label for the previous level are masked (see Fig. 12). The label for is the highest scoring unmasked node. The loss is computed over a subset of the original nodes for any level which is possible due to the availability of the parent-child relationship. This assumes that the parent label is correct. Due to less labels and more data, classification in upper levels is more accurate and since we perform this in a top down fashion, this is a reasonable assumption. Another work has shown this to be the case [tasho2018thesis].

While training, even if the model predicts the parent incorrectly, we still use the ground truth to penalize its prediction for the children. For data with unknown ground truth i.e. during evaluation, the model uses the predictions from level to infer about level by masking nodes that correspond to labels that are not possible as per the hierarchy. We minimize , where


is the true label for the i-th level. , , . is the j-th vertex (node) in the i-th level and consequently, is the node corresponding to the ground-truth on level . where, are the logits from the last layer model . is a continuous sub-sequence of the predicted logits , i.e. .

Hierarchical Softmax (HS).

HS model predicts logits for every node in the hierarchy. There are dedicated linear layers for each group of sibling nodes leading to a separate (conditional) probability distribution over them. This is probability conditioned on the parent node i.e.

, such that .

To reduce computation over large vocabularies, [morin2005hierarchical, mikolov2013hierarchical2] propose similar ideas for NLP. In the context of computer vision it is relatively unexplored and we propose to predict conditional distributions for each set of direct descendants to exploit the label-hierarchy.


, . The vector represents the logits that exclusively correspond to all the children of node . With this in place, for each set of children of a given node, a conditional probability distribution is output by . where, is the conditional probability for every child node given the parent,

. In order to calculate the joint distribution over the leaves, probabilities along the path from the root to each leaf are multiplied as

where, is the parent node of . The nodes belonging to the i-th level and the (i+1)-st level respectively.

The cross-entropy loss is computed only over the leaves but since the distribution is calculated using internal nodes, all levels are optimized implicitly. , where, is the true label for the i-th level. , .

4.2 Embedding Classifiers

We treat our label hierarchy as a directed-acyclic graph, more specifically as a directed tree graph. The dataset consists of entailment relations connected via a directed edge from to . (following the definition in [ganea2018entailment_cones]). These directed edges or hypernym links convey that is a sub-concept of .

4.2.1 Label and Image Representations

Label embeddings. For our implementation of the HC, the label-embeddings live in the hyperbolic space and are optimized using the RSGD as per Fig. 6

. RSGD is implemented by modifying the SGD gradients in PyTorch

[pytorch] as it is not a part of the standard library.

Image embeddings. For images, features from the final layer of the backbone of the best performing CNN-based model are used (). In order to map them to

we use a linear transform

and then apply a projection into via the exponential-map at zero which is equivalent to . This bring the image embeddings to the hyperbolic space with Euclidean parameters. This allows for optimizing the parameters with well know optimization schemes such as Adam [kingma2014adam].

4.2.2 Embedding Label-Hierarchy

We begin by learning to represent the taxonomical hierarchy alone. Considering only the label-hierarchy and momentarily excluding the images we model this problem as hypernym prediction where a hypernym pair represents two labels such that is-a . Embeddedings for the label-hierarchy with OE and EC are shown in Fig. 5 and Fig. 6.

Data splitting. We use the tree to form the “basic” edges for which the transitive closure can be fully recovered. If these edges are not present in the train set, the information about them is unrecoverable and therefore they are always included in the train set. Now, we randomly pick edges from the transitive closure [wiki:transitive_closure] minus the “basic” edges to form a set of “non-basic” edges. We use the “non-basic” edges to create val (5%) and test (5%) splits and a proportion of the rest are reserved for training.

Training details. We follow the training details in [ganea2018entailment_cones]. We augment both the validation and test set by 5 negative pairs each for : of the type and

with a randomly chosen edge that is not present in the full transitive closure of the graph. Generating 10 negatives for each positive. We report performance on different training set sizes. We vary the training set to include 0%, 10%, 25%, 50% of the “non-basic” edges selected randomly. We train for 500 epochs with a batch size of 10. We run two sets of experiments: one, we fix

as mentioned in [vendrov2015order] and two, tune based on the F1-score on the val set [ganea2018entailment_cones].

Pick-per-level strategy. During the experiments, instead of sampling a negative edge uniformly from candidate , we pick each from a different level in the hierarchy. This serves a dual purpose. 78.24% of the nodes belong to the final level in the hierarchy and uniform negative sampling would result in edges where is from the last level majority of the times, making convergence slow. Secondly, this strategy samples hard negatives edges from the same level as the non-corrupted node , helping embeddings to disentangle and spread out in space.

Optimization details. We use Adam optimizer [kingma2014adam] for order-embeddings and Euclidean cones. For hyperbolic cones we use RSGD [ganea2018entailment_cones]. . We also embed synthetic trees of varying height and branching factor using OE and EC. The final embeddings are visualized in Fig. 13.

4.2.3 Jointly Embedding Images with Label-Hierarchy

In order-embeddings [vendrov2015order], the images are put on the lower-level and the captions on the upper level as images are more detailed while captions represent concepts more abstract than the image itself. For jointly embedding the images together with the labels we use the hypernym loss from Eq. 1. We modify it such that now in addition to the labels, (the graph representing the hierarchy) also contains images as nodes as leaves at the lowest level. constitutes of two types of edges: an edge can be such that or . The embeddings are computed differnently for images and labels but in the end, both and map respective inputs to the same space.
Multi-label Classification with Embeddings Since our problem does not concern hypernym prediction but rather assigning multiple labels to an image; instead of performing edge prediction (as the case would be in a hypernym prediction task) we use the embeddings for the task of classification. To classify an image we compute the order-violation energy between the given image and each label and pick the label corresponding to the minimum violation, .
Generating Label and Image Embeddings To generate image embeddings we use the best performing CNN model trained on the ETHEC dataset and extract fc7-features from the penultimate layer. We use a learnable linear transformation, a matrix , on top of the fc7-features to be able to adjust the fc7-features and map them into the joint embedding space: . represent the fc7-features from our best performing CNN model and is a matrix. The weights of the CNN are frozen to calculate the fc7-features with only that can be learned. For the labels, is just a lookup table that stores vectors in . The embedding are in for Euclidean models and for hyperbolic models (Poincaré disk).

Data splitting. We split the data the same way as for the CNN models: train (80%), val (10%) and test (10%) based solely on the images. The graph contains directed edges from each label to the image that it “describes” as well as edges between related labels.
Training details. Let represent the graph to be embedded. All edges in , the transitive closure of , are considered as positive edges. To obtain negative edges, is constructed by removing the edges in from a fully-connected di-graph with the same nodes as .

While training, we generate negative pairs as mentioned in Section 4.2.2 with the pick-per-level strategy. We make sure that we do not sample a negative edge such that both and are images. This ensures that no two images are forced apart unless their labels require them to do so. For validation and testing, we measure the model’s classification the val and test set images respectively.

Graph reconstruction task. In addition to the classification task, we also check the quality of reconstruction of the label-hierarchy itself. Here, all the edges in that correspond to edges between labels are treated as positive edges, while the the edges in that correspond to edges between labels are treated as negative edges. We compute where and choose a threshold to classify edges as positive and negative using that yields the best F1-score on this label-hierarchy reconstruction task. This task does not use any edges that have an image on any side to check the quality of reconstruction.

For we use a linear transformation, a matrix . Non-linearity is not applied to the output that maps to the embedding space.

Optimization details. For jointly embedding labels and images, we empirically found using Adam [kingma2014adam] optimizer instead of the RSGD. The label embeddings are parameterized in the Euclidean space and we use the to map them to the hyperbolic space. This is observed to be more stable and helps better converge the joint embeddings. Also, with this implementation of the hyperbolic cones, for both labels and joint embeddings, it was not necessary to initialize the embeddings with the Poincaré embeddings [nickel2017poincare] as suggested in [ganea2018entailment_cones]. However, a performance boost is obtained when initialized with values from embedding only the label-hierarchy. EC: 200 epochs, , . HC: 100 epochs, , , Initialization from label-embeddings only model. Adam and .

5 Experiments

Data. We empirically evaluate our work on the real-world ETH Entomological Collection (ETHEC) dataset111https://www.research-collection.ethz.ch/handle/20.500.11850/365379 comprising images of Lepidoptera specimens with their taxonomy tree. The real-world dataset has variations not only in terms of the images per category but also a significant imbalance in the structure of the taxonomical tree. In Fig. 2 we illustrate the data distribution for each label in the ETHEC hierarchy.

Figure 7: Label-only embeddings with HC projected to 2D. The embeddings organize themselves such that more generic concepts are closer to the origin while the most specific concepts form the periphery. Color coding as Fig. 5.
classify test set images graph reconstruction
Model m-F1 hit@3 hit@5 TPR TNR full-F1
Euclidean Cones
0.780 0.889 0.920 0.805 0.998 0.704
0.835 0.902 0.943 0.963 0.999 0.821
0.801 0.897 0.928 0.815 0.998 0.707
Hyperbolic Cones
0.840 0.920 0.939 0.642 0.998 0.576
0.805 0.902 0.928 0.523 0.997 0.483
Table 1: The table summarizes the embedding model performance when used to classify images for the ETHEC dataset. The joint image and label embeddings live in or . m-F1 is the critical metric for image classification performance. We also report the quality of the reconstruction for the label-hierarchy after the joint embedding.

5.1 Hierarchical Classification Performance

To perform image classification using embeddings, the least violating energy for a given image across all possible labels in a given level in the hierarchy is considered as the predicted label. The CNN models use Adam [kingma2014adam] for 100 epochs with 224 x 224 RGB images and batch size=64. For HAB, PLC: ; MC, M-PLC, HS: . We empirically found ResNet-50 for HAB, PLC, MC, M-PLC and ResNet-152 for HS among ResNet 50, 101, 152 variants.

Model m-F1
CNN-based methods
HAB 0.8147 0.9417 0.9446 0.8311 0.4578
PLC 0.9084 0.9766 0.9661 0.9204 0.7704
MC 0.9223 0.9887 0.9758 0.9273 0.7972
M-PLC 0.9173 0.9828 0.9701 0.9233 0.7930
HS 0.9180 0.9879 0.9731 0.9253 0.7855
Order-preserving (joint) embedding models
EC d=100 0.8350 0.9728 0.9370 0.8336 0.5967
HC d=100 0.7627 0.9695 0.9205 0.7523 0.4246
HC d=100 0.8404 0.9800 0.9439 0.8477 0.5977
Table 2: Both EC and HC exploit hierarchical information and outperform the hierarchy-agnostic classifier baseline. We include the overall m-F1 in addition to the separate m-F1 across the 4 levels in the ETHEC dataset. All joint-embeddings models are initialized using labels-only embeddings. =random initalization, best overall model, best model in category.

Table 2 shows that the hierarchy-agnostic baseline is outperformed by all models that use any kind of hierarchical information. Embeddings: a completely different class of models, used widely in context of natural language but are relatively unexplored for image classification, also outperform HAB.

Figure 8: Jointly embedding labels and images using EC in . Color coding follow Fig. 5, grey: images. The images are accumulated around the periphery, away from the origin.

W’s model capacity. We use a matrix that transforms fc7 image features to the embedding space. A more elaborate 4-layer feed-forward neural network was also used but performed worse and was hard to optimize. Jointly training the complete CNN was also over-fitting.

Negative edge frequency. For joint-embedding the ETHEC dataset, since the images (around 50,000) outnumber the labels (723) we thought it might be useful to randomly sample negative edges such that the ratio of negative nodes have a proportion to be 50%:50% for images:label ratio however, the original strategy works better.

Choice of Optimizer. Initial experiments for the hyperbolic cones (HC) used the RSGD optimizer as it seemed to work for labels-only embeddings hyperbolic cones. When using the same to optimize over the labels for the joint-embedding model, we noticed that the label hierarchy moves towards the image labels and ends up collapsing from a very good initialization (taken from the labels-only embeddings). The collapse leads to entanglement between nodes from different labels and images, which leads it to a point of no return and the performance worsens due to the label-hierarchy becoming disarranged and its inability to recover. We believe that the reason for its inability to rearrange is due to there being a two different types of objects being embedded (and also being computed differently) and it compounded by using different optimizers.

In our experiments we obtain best results when using the Adam optimizer even if it means the update step for parameters living in hyperbolic space has to be performed in an approximate manner. Adam optimizer with an approximate update step works better in practice than RSGD with its mathematically more precise update step.

Label initialization for joint-embeddings Using RSGD we observed that if the labels are not initialized with the labels-only embedding then the joint model finds it difficult to disentangle the label embeddings and eventually this effect is cascaded to the images causing the image classification performance to not improve.

With the RSGD replaced by the Adam optimizer, in experiments where we randomly initialized the label-embeddings, we observed them to disentangle and form entailment cones even with the images being involved and making the optimization more complex. The joint-model still works well with random label initialization and achieves an image classification m-F1 score of 0.7611 and even outperforms the hierarchy-agnostic CNN in the m-F1 . [ganea2018entailment_cones] recommends to use Poincaré embeddings [nickel2017poincare] to initialize the hyperbolic cones model. The fact that the joint model as well as the labels-only hyperbolic cones have great performance without any special initialization scheme is interesting. We conjecture that this could be because of using an approximate yet better optimizer.

6 Conclusion

We propose an embedding-based approach for image classification using entailment cones, a recently proposed type of order-preserving embeddings. In particular, we compare these both in the Euclidean geometry setting and in the hyperbolic setting, and show that hyperbolic geometry provides an empirical advantage over Euclidean geometry. We also propose and compare a set of simple hierarchical classifier baselines where the hierarchy is incorporated in the loss function. Although these tend to perform slightly better than embedding-based approaches, they are less flexible as they assume that the hierarchy is fixed, and are more limited in terms of downstream tasks (e.g. they do not allow for hierarchy-based retrieval). Finally, we evaluate our methods on the real-world ETHEC dataset, and show that exploiting hierarchical information always leads to an improvement over a shallow CNN classifier.


7 Appendix

7.1 Schematics for CNN-based models

Figure 9: Model schematic for the hierarchy-agnostic classifier. The model is a multi-label classifier and does not utilize any information about the presence of an explicit hierarchy in the labels.
Figure 10: Model schematic for the per-level classifier (= -way classifiers). The model use information about the label-hierarchy by explicitly predicting a single label per level for a given image.
Figure 11: Model schematic for the Marginalization method. Instead of predicting a label per level, the model outputs a probability distribution over the leaves of the hierarchy. Probability for non-leaf nodes is determined by marginalizing over the direct descendants. The Marginalization method models how different nodes are connected among each other in addition to the fact that there are levels in the label-hierarchy.
Figure 12: Model schematic for the Masked Per-level classifier. The model is trained exactly like the L -way classifier. While predicting, one assumes the model performs better for upper levels than lower levels. Keeping this in mind, when predicting a label for a lower level, the model’s prediction for the level above is used to mask all infeasible descendant nodes, assuming the model predicts correctly for the level above. This results in competition only among the descendants of the predicted label in the level above.

7.2 Performance metrics

True positive rate

True positive rate (TPR) is the fraction of actual positives predicted correctly by the method.

True negative rate

True negative rate (TNR) is the fraction of actual negatives predicted correctly by the method.


Precision computes what fraction of the labels predicted true by the model are actually true.


Recall computes what fraction of the true labels were predicted as true.


where, is the set of the top-K predictions for the i-th data sample.

Macro-averaged score

A macro-averaged score for a metric is calculated by averaging the metric across all labels.

Micro-averaged score

A micro-averaged score for a metric is calculated by accumulating contributions (to the performance metric) across all labels and these accumulated contributions are used to calculate the micro score.

7.3 ETHEC dataset

The ETHEC dataset contains 47,978 images of the “order” Lepidoptera with corresponding labels across 4 different levels. According to the way the taxonomy is defined, the specific epithet (species) name associated with a specimen may not be unique. For instance, two samples with the following set of labels, (Pieridae, Coliadinae, Colias, staudingeri) and (Lycaenidae, Polyommatinae, Cupido, staudingeri) have the same specific epithet but differ in all the other label levels - family, subfamily and genus. However, the combination of the genus and specific epithet is unique. To ensure that the hierarchy is a tree structure and each node has a unique parent, we define a version of the database where there is a 4-level hierarchy - family (6), subfamily (21), genus (135) and genus + specific epithet (561) with a total of 723 labels. We keep the genus level as according to experts in the field, information about genera helps distinguish among samples and result in a better performing model.

7.4 HAB details

Here we discuss the details of having a single threshold for every label or a common threshold for all labels in a multi-label classification setting. Here we observe the maximum and minimum labels predicted by the multi-label model across the whole dataset. We also look at the mean and standard deviation of the number of labels predicted.

7.4.1 Per-class decision boundary (PCDB) models

The ill-effects of such free rein are reflected in Table 3. Models with a high average number of predictions, especially the per-class decision boundary (PCDB) models, have high recall as they predict a lot more than just 4 labels for a given image. Predicting the image’s membership in a lot of classes improves the chances of predicting the correct label but at the cost of a large number of false positives. The (min, max), column clearly shows the reckless behavior of the model predicting a maximum of 718 labels for one such sample and 451.14 136.69 on average for the worst performing multi-label model in our experiments.

7.4.2 One-fits-all decision boundary (OFADB) models

The one-fits-all decision boundary (OFADB) performs better than the same model with per-class decision boundaries (PCDB). We believe that the OFADB prevents over-fitting, especially in the case when many labels have very few data samples to learn from, which is the case for the ETHEC database. Here too, the nature of the multi-label setting allows the model to predict as many labels as it wants however, there is a marked difference between the (min, max), statistics when comparing between the OFADB and PCDB. The best performing OFADB model predicts 3.10 1.16 labels on average. This is close to the correct number of labels per specimen which is equal to the 4 levels in the label hierarchy.

7.4.3 Loss reweighing and Data re-sampling

Both data re-sampling and loss re-weighing remedy imbalance across different labels but via different paradigms. Instead of modifying what the model sees during training, reweighing the loss instead penalizes different data points differently. We choose to use the inverse-frequency of the label as weights that scale loss corresponding to the data point belonging to a particular label.

re-sampling involves choosing some samples multiple times while omitting others by over-sampling and under-sampling. We wish to prevent the model from being biased by the population of data belonging to a particular label. We perform re-sampling based on the inverse-frequency of a label in the train set. In our experiments re-sampling significantly outperforms loss reweighing confirming the observations made in [seiffert2008resampling].

cw rs m-P m-R m-F1 (min, max),
ResNet-50 - Per-class decision boundary
0.0355 0.7232 0.0677 (3, 351), 81.4 69.5
0.7159 0.7543 0.3718 (0, 13), 4.2 2.1
0.0077 0.8702 0.0153 (84, 718), 451.1 136.7
0.0081 0.7519 0.0161 (33, 714), 370.0 120.6
ResNet-50 - One-fits-all decision boundary
0.9324 0.7235 0.8147 (0, 7), 3.1 1.2
0.9500 0.6564 0.7763 (0, 5), 2.8 0.6
0.2488 0.2960 0.2704 (4, 9), 4.8 0.8
0.1966 0.3800 0.2591 (4, 10), 7.7 0.6
Table 3: Performance metrics for the HAB on the ETHEC dataset. The models used in this experiment are pre-trained on the 1000-class ImageNet data set. All weights are updated with a learning rate of 0.01, a batch-size of 64 and input spatial dimensions are 224x224 for 100 epochs. P, R and F1 represent Precision, Recall and F1-score; cw and rs represent class weight and re-sampling. m are micro-averaged metrics. The top performing models are in bold-face. Since, the model can predict any number of labels (between 0 and ), the table includes the minimum and the maximum number of labels predicted (min, max) as well as the number of labels predicted on average . These statistics, like the rest, are calculated for samples in the test set.
(a) Order-embeddings L=4, b=3
(b) Order-embeddings L=3, b=7
(c) Euclidean cones L=4, b=3
(d) Euclidean cones L=3, b=7
Figure 13: We embed 2 different toy graphs. One with 4 levels and a branching factor of 4 and another one with 3 levels and a branching factor of 7. The model is trained for 1000 epochs with Adam (learning rate of 0.01). The toy graphs are embedded using both order-embeddings and euclidean cones in . We draw an edge between each node that is connected in the original in order to better visualize the embedding quality. Nodes from different levels are colored differently. The illustrations show the levels and branching factor, the edges are split into train, val and test and report F1-score, precision, recall and accuracy; and the threshold to decide if a pair of nodes have a directed edge or equivalently if they are hypernyms.
(a) Aporia crataegi [ENT01_2017_03_27_007897]
(b) Parnassius stubbendorfii [ENT01_2018_03_09_132877]
(c) Parnassius delphius [ENT01_2018_03_09_133076]
(d) Parnassius delphius [ENT01_2018_03_09_133091]
Figure 14: Both semantic similarity and visual similarity are required to perform tasks relating to image understanding. Here, we see an example from the ETHEC dataset. At first glance, (a) and (b) look like they belong to the same class and so do (c) and (d) considering the visual similarities. However, this is not so straight-forward as (a) and (b) belong to two separate genera and species but have a really low inter-class variance. On the other hand, (b), (c) and (d) all share the same genus Parnassius but have a larger intra-class variance than (a) and (b). This demonstrates how visual similarity might not imply semantic similarity and vice-versa.
(a) Hyperbolic Cones 100-D
(b) Hyperbolic Cones 1000-D
Figure 15: Projected visualization of labels embedded using hyperbolic cones in 100 and 1000 dimensions. The cyan nodes represent family, the magenta nodes represent sub-family, the yellow nodes genus and black nodes genus+species. This resembles a flower-like shape where the more generic concepts are closer to the origin and at the base of this flower-like shape and most specific concepts at the tip of the petals which forms the periphery are a visible the most (=black nodes).