Log In Sign Up

Making Better Mistakes: Leveraging Class Hierarchies with Deep Networks

by   Luca Bertinetto, et al.
FiveAI Inc

Deep neural networks have improved image classification dramatically over the past decade, but have done so by focusing on performance measures that treat all classes other than the ground truth as equally wrong. This has led to a situation in which mistakes are less likely to be made than before, but are equally likely to be absurd or catastrophic when they do occur. Past works have recognised and tried to address this issue of mistake severity, often by using graph distances in class hierarchies, but this has largely been neglected since the advent of the current deep learning era in computer vision. In this paper, we aim to renew interest in this problem by reviewing past approaches and proposing two simple modifications of the cross-entropy loss which outperform the prior art under several metrics on two large datasets with complex class hierarchies: tieredImageNet and iNaturalist19.


page 1

page 2

page 3

page 4


Imbalanced Image Classification with Complement Cross Entropy

Recently, deep learning models have achieved great success in computer v...

Leveraging Class Similarity to Improve Deep Neural Network Robustness

Traditionally artificial neural networks (ANNs) are trained by minimizin...

SimLoss: Class Similarities in Cross Entropy

One common loss function in neural network classification tasks is Categ...

Continual Deep Learning by Functional Regularisation of Memorable Past

Continually learning new skills is important for intelligent systems, ye...

No Cost Likelihood Manipulation at Test Time for Making Better Mistakes in Deep Networks

There has been increasing interest in building deep hierarchy-aware clas...

Use Image Clustering to Facilitate Technology Assisted Review

During the past decade breakthroughs in GPU hardware and deep neural net...

1 Introduction

Image classification networks have improved greatly over recent years, but generalisation remains imperfect, and test-time errors do of course occur. Conventionally, such errors are defined with respect to a single ground-truth class and reported using one or more top- measures ( typically set to or

). However, this practice imposes certain notions of what it means to make a mistake, including treating all classes other than the “true” label as equally wrong. This may not actually correspond to our intuitions about desired classifier behaviour, and for some applications this point may prove crucial. Take the example of an autonomous vehicle observing an object on the side of the road: whatever measure of classifier performance we use, we can certainly agree that mistaking a lamppost for a tree is less of a problem than mistaking a person for a tree, as such a mistake would have crucial implications in terms both of prediction and planning. If we want to take such considerations into account, we must incorporate a nontrivial model of the relationships between classes, and accordingly rethink more broadly what it means for a network to “make a mistake”. One natural and convenient way of representing these class relationships is through a taxonomic hierarchy tree.

Figure 1:

Top-1 error and distribution of mistakes w.r.t. the WordNet hierarchy for well-known deep neural network architectures on ImageNet: see text for definition of mistake severity. The top-1 error has enjoyed a spectacular improvement in the last few years, but even though the number of mistakes has decreased in absolute terms, the severity of the mistakes made has remained fairly unchanged over the same period. The grey dashed lines denote the minimal possible values of the respective measures.

This idea is not new. In fact, it was once fairly common across various machine learning application domains to consider class hierarchy when designing classifiers, as surveyed in Silla & Freitas 

[silla2011survey]. That work assembled and categorised a large collection of hierarchical classification problems and algorithms, and suggested widely applicable measures for quantifying classifier performance in the context of a given class hierarchy. The authors noted that the hierarchy-informed classifiers of the era typically empirically outperformed “flat” (i.e. hierarchy-agnostic) classifiers even under standard metrics, with the performance gap increasing further under the suggested hierarchical metrics. Furthermore, class hierarchy is at the core of the ImageNet dataset: as detailed in Deng et al[deng2009imagenet], it was constructed directly from WordNet [miller1998wordnet], itself a hierarchy originally designed solely to represent semantic relationships between words. Shortly after ImageNet’s introduction, works such as Deng et al[deng2010does], Zhao et al[zhao2011large], and Verma et al[verma2012learning] explicitly noted that the underpinning WordNet hierarchy suggested a way of quantifying the severity of mistakes, and experimented with minimising hierarchical costs accordingly. Likewise, Deng et al[deng2011hierarchical]

presented a straightforward method for using a hierarchy-derived similarity matrix to define a more semantically meaningful compatibility function for image retrieval. Despite this initial surge of interest and the promising results accompanying it, the community shortly decided that hierarchical measures were not communicating substantially different information about classifier performance than top-

measures111From Russakovsky et al[russakovsky2015imagenet]: “[..] we found that all three measures of error (top-5, top-1, and hierarchical) produced the same ordering of results. Thus, since ILSVRC2012 we have been exclusively using the top-5 metric which is the simplest and most suitable to the dataset.”, and effectively discarded them. When the celebrated results in Krizhevsky et al[krizhevsky2012imagenet] were reported in flat top- terms only, the precedent was firmly set for the work which followed in the deep learning era of image classification. Interest in optimising hierarchical performance measures waned accordingly.

We argue here that this problem is ripe for revisitation, and we begin by pointing to Fig. 1. Here, a mistake is defined as a top-1 prediction which differs from the ground-truth class, and the severity of such a mistake is the height of the lowest common ancestor of the predicted and ground-truth classes in the hierarchy. We see that while the flat top-1 accuracies of state-of-the-art classifiers have improved to impressive levels over the years, the distributions of the severities of the errors that are made have changed very little over this time. We hypothesise that this is due, at least in part, to the scarcity of modern learning methods which attempt to exploit prior information and preferences about class relationships in the interest of “making better mistakes”, whether this information is sourced from an offline taxonomy or otherwise. The few exceptions of which we are aware include Frome et al[frome2013devise], Wu et al. [wu2016learning], Barz & Denzler [barz2019hierarchy], and a passing mention in Redmon & Farhadi [redmon2017yolo9000]. In Sec. 2, we suggest a framework for thinking about these pieces of work, their predecessors, and some of their conceptual relatives.

The contributions of this work are as follows:

  1. We review relevant literature within an explanatory framework which unifies a fairly disjoint prior art.

  2. Building on the perspective gained from the preceding, we propose two methods that are both simple and effective at leveraging class hierarchies.

  3. We perform an extensive experimental evaluation to both demonstrate the effectiveness of the said methods compared to prior art and to encourage future work on the topic.

To ensure reproducibility, the PyTorch 

[paszke2019pytorch] code of all our experiments will be made available at

2 Framework and related work

We first suggest a simple framework for thinking about methods relevant to the problem of making better mistakes on image classification, beginning with the standard supervised setup. Consider a training set which pairs images with class labels . A network architecture implements the predictor function , whose parameters are learned by minimising the empirical risk


where the loss function

compares the predictor’s output to an embedded representation of each example’s class, and is a regulariser.

Under common choices such as cross-entropy for and one-hot embedding for , it is easy to see that the framework is agnostic of relationships between classes. The question is how such class relationships can be incorporated into the loss in Eqn. 1. We identify the following three approaches:

  1. Replacing class representation with an alternate embedding . Such “label-embedding” methods, discussed in Sec. 2.1, can draw their embedding both from taxonomic hierarchies and alternative sources.

  2. Altering the loss function in terms of its arguments to produce i.e. making the penalty assigned to a given output distribution and embedded label dependent on . Methods using these “hierarchical losses” are covered in Sec. 2.2.

  3. Altering the function to , i.e. making hierarchically-informed architectural changes to the network, generally with the hope of introducing a favourable inductive bias. We cover these “hierarchical architectures” in Sec. 2.3 .

While a regulariser is certainly feasible, it is curiously rare in practice: [zhao2011large] is the only example we know of.

2.1 Label-embedding methods

These methods map class labels to vectors whose relative locations represent semantic relationships, and optimise a loss on these embedded vectors. The DeViSE method of Frome 

et al[frome2013devise] maps target classes onto a unit hypersphere via the method of Mikolov et al[mikolov2013efficient]

, assigning terms with similar contexts to similar representations through analysis of unannotated Wikipedia text. The loss function is a ranking loss which penalises the extent to which the output is more cosine-similar to false label embeddings than to the correct one. They learn a linear mapping from a pre-trained visual feature pipeline (based on AlexNet) to the embedded labels, then fine-tune the visual pipeline itself through backpropagation. Romera-Paredes & Torr 

[romera2015embarrassingly] note that their one-line solution for learning an analogous linear mapping for zero-shot classification should easily extend to accommodating these sorts of embeddings. In Hinton et al[hinton2015distilling], the role of the label embedding function is played by a temperature-scaled pre-existing classifier ensemble. This ensemble is “distilled” into a smaller DNN through cross-entropy minimisation against the ensemble’s output222This refers to the “distillation” section of [hinton2015distilling]. There is a separate hierarchical classification section discussed later.. For zero-shot classification, Xian et al[xian2016latent] experiment with various independent embedding methods, as is also done in Akata et al[akata2015evaluation]: annotated attributes, word2vec [mikolov2013distributed], glove [pennington2014glove], and the WordNet hierarchy. Their ranking loss function is functionally equivalent to that in Frome et al[frome2013devise], and they learn a choice of linear mappings to these representations from the visual features output by a fixed CNN. Barz & Denzler [barz2019hierarchy]

present an embedding algorithm which iteratively maps examples onto a hypersphere such that all cosine distances represent similarities derived from lowest common ancestor (LCA) height in a given hierarchy tree. They train a full deep model, and alternately present two different losses: (1) a linear loss based on cosine distance to the embedded class vectors, used for retrieval, and (2) the standard cross-entropy loss on the output of a fully connected/softmax layer added after the embedding layer, for classification.

2.2 Hierarchical losses

In these methods, the loss function itself is parametrised by the class hierarchy such that a higher penalty is assigned to the prediction of a more distant relative of the true label. Deng et al[deng2010does]

simply train kNN- and SVM-based classifiers to minimise the expected WordNet LCA height directly. Zhao 

et al[zhao2011large]

modify standard multi-class logistic regression by recomputing output class probabilities as the normalised class-similarity-weighted sums of all class probabilities calculated as per the usual regression. They also regularise feature selection using an “overlapping-group-lasso penalty” which encourages the use of similar features for closely related classes, a rare example of a hierarchical regulariser. Verma 

et al[verma2012learning] incorporate normalised LCA height into a “context-sensitive loss function” while learning a separate metric at each node in a taxonomy tree for nearest-neighbour classification. Wu et al[wu2016learning]  consider the specific problem of food classification. They begin with a standard deep network pipeline and add one fully connected/softmax layer to the end of the (shared) pipeline for each level of the hierarchy, so as to explicitly predict each example’s lineage. The loss is the sum of standard log losses over all of the levels, with a single user-specified coefficient for reweighting the loss of all levels other than the bottom one. Because this method is inherently susceptible to producing marginal probabilities that are inconsistent across levels, it incorporates a subsequent label-propagation smoothing step. Alsallakh et al[alsallakh2017convolutional] likewise use a standard deep architecture as their starting point, but instead add branches strategically to intermediate pipeline stages. They thereby force the net to classify into offline-determined superclasses at the respective levels, backpropagating error in these intermediate predictions accordingly. On deployment, these additions are simply discarded.

2.3 Hierarchical architectures

These methods attempt to incorporate class hierarchy into the classifier architecture without necessarily changing the loss function otherwise. The core idea is to “divide and conquer” at the structural level, with the classifier assigning inputs to superclasses at earlier layers and making fine-grained distinctions at later ones. In the context of language models, it was noted at least as early as Goodman [goodman2001classes] that classification with respect to an IS-A hierarchy tree could be formulated as a tree of classifiers outputting conditional probabilities, with the product of the conditionals along a given leaf’s ancestry representing its posterior; motivated by efficiency, Morin & Bengio [morin2005hierarchical] applied this observation to a binary hierarchy derived from WordNet. Redmon & Farhadi [redmon2017yolo9000] propose a modern deep-learning variant of this framework in the design of the YOLOv2 object detection and classification system. Using a version of WordNet pruned into a tree, they effectively train a conditional classifier at every parent node in the tree by using one softmax layer per sibling group and training under the usual cross-entropy loss over leaf posteriors. While their main aim is to enable the integration of the COCO detection dataset with ImageNet, they suggest that graceful degradation on new or unknown object categories might be an incidental benefit. Brust & Denzler [brust2018integrating] propose an extension of conditional classifier chains to the more general case of DAGs.

The above approaches can be seen as a limiting case of hierarchical classification, in which every split in the hierarchy is cast as a separate classification problem. Many hierarchical classifiers fall between this extreme and that of flat classification, working in terms of a coarser-grained conditionality in which a “generalist” makes hard or soft assignments to groupings of the target classes before then distinguishing the group members from one another using “experts”. Xiao et al[xiao2014error], the quasi-ensemble section of Hinton et al[hinton2015distilling], Yan et al[yan2015hd], and Ahmed et al[ahmed2016network] all represent modern variations on this theme (which first appears no later than [jacobs1991adaptive]). Additionally, the listed methods all use some form of low-level feature sharing either via architectural constraint or parameter cloning, and all infer the visual hierarchy dynamically through confusion clustering or latent parameter inference. Alsallakh et al[alsallakh2017convolutional] make the one proposal of which we are aware which combines hierarchical architectural modifications (at train time) with a hierarchical loss, as described in Sec. 2.2. At test time, however, the architecture is that of an unmodified AlexNet, and all superclass “assignment” is purely implicit.

3 Method

We now outline two simple methods that allow us to leverage class hierarchies in order to make better mistakes on image classification. We concentrate on the case where the output of the network is a categorical distribution over classes for each input image and denote the corresponding distribution as , where subscripts denote vector indices and and are omitted for brevity. In Sec. 3.1, we describe the hierarchical cross-entropy (HXE), a straightforward example of the hierarchical losses reviewed in Sec. 2.2. This approach expands each class probability into the chain of conditional probabilities defined by its unique lineage in a given hierarchy tree. It then reweights the corresponding terms in the loss so as to penalise classification mistakes in a way that is informed by the hierarchy. In Sec. 3.2, we suggest an easy choice of embedding function to implement the label-embedding framework covered in Sec. 2.1. The resulting soft labels are PMFs over whose values decay exponentially w.r.t. an LCA-based distance to the ground truth.

3.1 Hierarchical cross-entropy

When the hierarchy is a tree, it corresponds to a unique factorisation of the categorical distribution over classes in terms of the conditional probabilities along the path connecting each class to the root of the tree. Denoting the path from a leaf node to the root as , the probability of class can be factorised as


where , , and is the height of the node . Note that we have omitted the last term . Conversely, the conditionals can be written in terms of the class probabilities as


where denotes the set of leaf nodes of the subtree starting at node .

A direct way to incorporate hierarchical information in the loss is to hierarchically factorise the output of the classifier according to Eqn. 2 and define the total loss as the reweighted sum of the cross-entropies of the conditional probabilities. This leads us to define the hierarchical cross-entropy (HXE) as


where is the weight associated with the edge node , see Fig. 2a. Even though this loss is expressed in terms of conditional probabilities, it can be easily applied to models that output class probabilities using Eqn. 3. Note that reduces to the standard cross-entropy when all weights are equal to . This limit case, which was briefly mentioned by Redmon & Farhadi in their YOLO-v2 paper [redmon2017yolo9000], results only in architectural changes but does not incorporate hierarchical information in the loss directly.

Eqn. 4 has an interesting information-theoretical interpretation: since each term corresponds to the the information associated with the edge in the hierarchy, the HXE corresponds to discounting the information associated with each of these edges differently. Note that since the HXE is expressed in terms of conditional probabilities, the reweighting in Eqn. 4 is not equivalent to reweighting the cross-entropy for each possible ground truth class independently (as done, for instance, in [lin2017focal, cui2019class]).

A sensible choice for the weights is to take


where is the height of node and

is a hyperparameter that controls the extent to which information is discounted down the hierarchy. The higher the value of

, the higher the preference for “generic” as opposed to “fine-grained” information, because classification errors related to nodes further away from the root receive a lower loss. While such a definition has the advantage of simplicity, one could think of other meaningful weightings, such as ones depending on the branching factor of the tree or encoding a preference towards specific classes. We concentrate on equation 5 here, as it is both simple and easily interpretable while leaving a more systematic exploration of different weighting strategies for future work.

3.2 Soft labels

Our second approach to incorporating hierarchical information, soft labels, is a label-embedding approach as described in Sec. 2.1. These methods use a mapping function to associate classes with representations which encode class-relationship information that is absent in the trivial case of the one-hot representation. In the interest of simplicity, we choose a mapping function which outputs a categorical distribution over the classes. This enables us to simply use the standard cross-entropy loss:


where the soft label embedding is given componentwise by


for class distance function and parameter . This loss is illustrated in Fig. 2b. For the distance function , we use the height of LCA divided by the height of the tree. To understand the role of the hyperparameter , note that values of that are much bigger than the typical inverse distance in the tree result in a label distribution that is nearly one-hot, i.e. , in which case the cross-entropy reduces to the familiar single-term log-loss expression. Conversely, for very small values of the label distribution is near-uniform. Between these extremes, greater probability mass is assigned to classes more closely related to the ground truth, with the magnitude of the difference controlled by .

We offer two complementary interpretations that motivate this representation (besides its ease). For one, the distribution describing each target class can be considered to be a model of the actual uncertainty that a labeller (e.g. human) would experience due to visual confusion between closely related classes. It could also be thought of as encoding the extent to which a common response to different classes is required of the classifier, i.e. the imposition of correlations between outputs, where higher correlations are expected for more closely related classes. This in turn suggests a connection to the superficially different but conceptually related distillation method of Hinton et al[hinton2015distilling], in which correlations between a large network’s responses to different classes are mimicked by a smaller network to desirable effect. Here, we simply supply these correlations directly, using widely available hierarchies.

Another important connection is the one to the technique of label smoothing [szegedy2016rethinking]

, in which the “peaky” distribution represented by a one-hot label is combined with the uniform distribution. This technique has been widely adopted to regularise the training of large neural networks (

e.g[szegedy2016rethinking, chorowski2017towards, vaswani2017attention, zoph2018learning]), but has only very recently [muller2019does] been studied more thoroughly.

(a) [R[D, edge=line width=1.5, edge label=node[midway,left,font=][ , edge=line width=1.5, edge label=node[midway,left,font=]][ , edge label=node[midway,right,font=]]][ , edge label=node[midway,right,font=], l*=2.25]] (b) [R[D, edge=line width=1.5[ , edge=line width=1.5][ , edge=line width=1.5]][ , edge=line width=1.5, l*=2.35]]

Figure 2: Representations of the HXE (Sec. 3.1) and soft labels (Sec. 3.2) losses for a simple illustrative hierarchy are drawn in subfigures (a) and (b) respectively. The ground-truth class is underlined, and the edges contributing to the total value of the loss are drawn in bold.

4 Evaluation

In the following, we first describe the datasets (Sec. 4.1) and metrics (Sec. 4.2) comprising the setup common to all of our experiments. Then, in Sec. 4.3, we empirically evaluate our two simple proposals and compare them to the prior art. Finally, we experiment with random hierarchies to understand when and how information on class relatedness can help classification.

4.1 Datasets

In our experiments, we use tieredImageNet [ren2018meta] (a large subset of ImageNet/ILSVRC’12 [russakovsky2015imagenet]) and iNaturalist’19 [van2018inaturalist], two datasets with hierarchies that are a) significantly different from one another and b) complex enough to cover a large number of visual concepts. ImageNet aims to populate the WordNet [miller1998wordnet] hierarchy of nouns, with WordNet itself generated by inspecting IS-A lexical relationships. By contrast, iNaturalist’19 [van2018inaturalist] has a biological taxonomy [ruggiero2015higher] at its core.

tieredImageNet was originally introduced by Ren et al[ren2018meta] for the problem of few-shot classification, in which the sets of classes between dataset splits are disjoint. The authors’ motivation in creating the dataset was to use the WordNet hierarchy to generate splits containing significantly different classes, facilitating better assessment of few-shot classifiers by enforcing problem difficulty.

Although our task and motivations are different, we chose this dataset because of the large portion of the WordNet hierarchy spanned by its classes. To make it suitable for the problem of (standard) image classification, we re-sampled the dataset so as to represent all classes across the train, validation, and test splits. Moreover, since the method proposed in Section 3.1 and YOLO-v2 [redmon2017yolo9000] require that the graph representing the hierarchy is a tree, we modified the graph of the spanned WordNet hierarchy slightly to comply with this assumption (more details available in Appendix A). After this procedure, we obtained a tree of height and images from different classes, which we randomly assigned to training, validation, and test splits with respective probabilities , , and . We refer to this modified version of tieredImageNet as tieredImageNet-H.

iNaturalist is a dataset of images of organisms that has so far mainly been used to evaluate fine-grained visual categorisation methods. The dataset construction protocol differs significantly from the one used for ImageNet in that it relies on passionate volunteers instead of workers paid per task [van2018inaturalist]. Importantly, for the 2019 edition of the CVPR Fine-Grained Visual Categorization Workshop, metadata with hierarchical relationships between species have been released. In contrast to WordNet, this taxonomy is an 8-level complete tree that can readily be used in our experiments without modifications. Since the labels for the test set are not public, we randomly re-sampled three splits from the total of images from classes, again with probabilities , , and for the training, validation, and test set, respectively. We refer to this modified version of iNaturalist’19 as iNaturalist19-H.

4.2 Metrics

We consider three measures of performance, covering different interpretations of a classifier’s mistakes.

Top- error. Under this measure, an example is defined as correctly classified if the ground truth is among the top classes with the highest likelihood. This is the measure normally used to compare classifiers, usually with or . Note that this measure considers all mistakes of the classifier equally, irrespective of how “similar” the predicted class is to the ground truth.

Hierarchical measures. We also consider measures that, in contrast to the top- error, do weight the severity of mistakes. We use the height of the lowest common ancestor (LCA) between the predicted class and the ground truth as a core severity measure, as originally proposed in the papers describing the creation of ImageNet [deng2009imagenet, deng2010does]. As remarked in [deng2010does], this measure should be thought of in logarithmic terms, as the number of confounded classes is exponential in the height of the ancestor. We also experimented with the Jiang-Conrath distance as suggested by Deselaers & Ferrari [deselaers2011visual], but did not observe meaningful differences wrt. the height of the LCA.

We consider two measures that utilise the height of the LCA between nodes in the hierarchy.

  • The hierarchical distance of a mistake is the height of the LCA between the ground truth and the predicted class when the input is misclassified, i.e. when the class with the maximum likelihood is incorrect. Hence, it measures the severity of misclassification when only a single class can be considered as a prediction.

  • The average hierarchical distance of top-, instead, takes the mean LCA height between the ground truth and each of the most likely classes. This measure is important, for example, when multiple hypotheses of a classifier can be considered for a certain downstream task.

4.3 Experimental results

In the following, we analyse the performance of the two approaches described in Sec. 3.1 and Sec. 3.2, which we denote by HXE and soft labels, respectively. Besides a vanilla cross-entropy-based flat classifier, we also implemented and compared against the methods proposed by Redmon & Farhadi [redmon2017yolo9000] (YOLO-v2)333Note that this refers to the conditional classifier subsystem proposed in Sec. 4 of that work, not the main object detection system., Frome et al[frome2013devise] (DeViSE), and Barz & Denzler [barz2019hierarchy]. As mentioned in Sec. 1, these methods represent, to the best of our knowledge, the only modern attempts to deliberately reduce the semantic severity of a classifier’s mistakes that are generally applicable to any modern architecture. Note, though, that we do not run DeViSE on iNaturalist19-H, as the class IDs of this dataset are alien to the corpus used by word2vec [mikolov2013efficient].

Implementation details. Since we are interested in understanding the mechanisms by which the above metrics can be improved, it is essential to use a simple configuration that is common between all of the algorithms taken into account.

We use a ResNet-18 architecture (with weights pretrained on ImageNet) trained with Adam [reddi2019convergence] for steps and mini-batches of size . We use a learning rate of unless specified otherwise. To prevent overfitting, we adopt PyTorch’s basic data augmentation routines with default hyperparameters: RandomHorizontalFlip() and RandomResizedCrop().

Further implementation details for all of the methods are deferred to Appendix B.

Figure 3: Top-1 error vs. hierarchical distance of mistakes, for tieredImageNet-H (top) and iNaturalist19-H (bottom). Points closer to the bottom-left corner of the plot are the ones achieving the best tradeoff.
Figure 4: Top-1 error vs. average hierarchical distance of top- (with for tieredImageNet-H (top three) and iNaturalist19-H (bottom three). Points closer to the bottom-left corner of the plot are the ones achieving the best tradeoff.

Main results. In Fig. 3 and 4 we show how it is possible to effectively trade off top-1 error to reduce hierarchical error, by simply adjusting the hyperparameters and in Eqn. 5 and 7. Specifically, increasing corresponds to (exponentially) discounting information down the hierarchy, thus more severely penalising mistakes where the predicted class is further away from the ground truth. Similarly, decreasing in the soft-label method amounts to progressively shifting the label mass away from the ground truth and towards the neighbouring classes. Both methods reduce to the cross-entropy in the respective limits and . Moreover, notice that varying affects the entropy of the distribution representing a soft label, where the two limit cases are for the standard one-hot case and for the uniform distribution. We experiment with and .

To limit noise in the evaluation procedure, for both of our methods and all of the competitors, we fit a 4th-degree polynomial to the validation loss (after having discarded the first

training steps) and pick the epoch corresponding to its minimum along with its four neighbours. Then, to produce the points reported in our plots, we average the results obtained from these five epochs on the validation set, while reserving the test set for the experiments of Table 

1. Notice how, in Fig. 4, when considering the hierarchical distance with , methods are almost perfectly aligned along the plot diagonal, which demonstrates the strong linear correlation between this metric and the top-1 error. This result is consistent with what is observed in [russakovsky2015imagenet], which, in , led the organisers of the ILSVRC workshop to discard rankings based on hierarchical distance.

When considering the other metrics described in Sec. 4.2, a different picture emerges. In fact, a tradeoff between top-1 error and hierarchical distance is evident in Fig. 3 and in the plots of Fig. 4 with and . Notice how the points on the plots belonging to our methods outline a set of tradeoffs that subsumes the prior art. For example, in Fig. 3, given any desired tradeoff betweeen top-1 error and hierarchical distance of mistakes on tieredImageNet-H, it is better to use HXE than any other method. A similar phenomenon is observable when considering the average hierarchical distance of top-5 and top-20 (Fig. 4), although in these cases it is better to use the soft labels. The only exception to this trend is represented by Barz & Denzler [barz2019hierarchy] on tieredImageNet-H, which can achieve slightly lower average hierarchical distance for or at a significant cost in terms of top-1 error.

Hier. dist. mistake Avg. hier. dist. @1 Avg. hier. dist. @5 Avg. hier. dist. @20 Top-1 error
Barz&Denzler [barz2019hierarchy]
YOLO-v2 [redmon2017yolo9000]
DeViSE [frome2013devise]
HXE (ours)
HXE (ours)
Soft-labels (ours)
Soft-labels (ours)
Barz&Denzler [barz2019hierarchy]
YOLO-v2 [redmon2017yolo9000]
HXE (ours)
HXE (ours)
Soft-labels (ours)
Soft-labels (ours)
Table 1: Results on the test sets of tieredImageNet-H (top) and iNaturalist19-H (bottom), with confidence intervals. For each column of each dataset, the best entry is hightlighted in yellow, while the worst is highlighted in gray.

Using the results illustrated in Fig. 3 and 4, we pick two reasonable operating points for both of our proposals: one for the high-distance/low-top1-error regime, and one for the low-distance/high-top1-error regime. We then run both of these configurations on the test sets and report our results on Table 1

. The means are again obtained from the five best epochs, and we use the standard deviation to compute 95% confidence intervals.

The trends observed on the validation set largely repeat themselves on the test set. When one desires to prioritise top-1 error, then soft labels with high or HXE with low are more appropriate, as they outperform the cross-entropy on the hierarchical-distance-based metrics while being practically equivalent in terms of top-1 error. In cases where the hierarchical measures should be prioritised instead, it is preferable to use soft labels with low or HXE with high , depending on the particular choice of hierarchical metric. Although the method of Barz & Denzler is competitive in this regime, it also exhibits the worst deterioration in top-1 error with respect to the cross-entropy.

Our experiments generally indicate, over all tested methods, an inherent tension between performance in the top-1 sense and in the hierarchical sense. We speculate that there may be a connection between this tension and observations proceeding from the study of adversarial examples indicating a tradeoff between robustness and (conventional) accuracy, as in e.g[tsipras2018robustness, zhang2019theoretically].

Can hierarchies be arbitrary? Although the lexical WordNet hierarchy and the biological taxonomy of iNaturalist are not visual hierarchies per se, they arguably reflect meaningful visual relationships between the objects represented in the underlying datasets. Since deep networks leverage visual features, it is interesting to investigate the extent to which the structure of a particular hierarchy is important. In other words, what would happen with an arbitrary hierarchy, one that does not have any relationship with the visual world?

To answer this question, we randomised the nodes of the hierarchies and repeated our experiments. Results on iNaturalist19-H are displayed in Fig. 5 (tieredImageNet-H exhibits a similar trend). Again, we report tradeoff plots showing top-1 errors on the x-axis and metrics based on the height of the LCA (on the randomised hierarchy) on the y-axis. It is evident that the hierarchical distance metrics are significantly worse when using the random hierarchy. Although this is not surprising, the extent to which the results deteriorate is remarkable. This suggests that the inherent nature of the structural relationship expressed by a hierarchy is paramount for learning classifiers that, besides achieving competitive top-1 accuracy, are also able to make better mistakes.

Curiously, for the soft labels, the top-1 error of the random hierarchy is consistently lower than its “real” hierarchy counterpart. We speculate this might be due to the structural constraints imposed by a hierarchy anchored to the visual world, which can limit a neural network from opportunistically learning correlations that allow it to achieve low top-1 error (at the expense of ever more brittle generalisation). Indeed, the authors of [zhang2016understanding] noted that it is more difficult to train a deep network to map real images to random labels than it is to do so with random images. The most likely explanation for this is that common visual features, which are inescapably shared by closely related examples, dictate common responses.

Figure 5: Top-1 error vs. hierarchical distance of mistakes (top) and hierarchical distance of top-20 (bottom) for iNaturalist19-H. Points closer to the bottom-left corner of the plots are the ones achieving the best tradeoff.

5 Conclusion

Since the advent of deep learning, the computer vision community’s interest in making better classification mistakes seems to have nearly vanished. In this paper, we have shown that this problem is still very much open and ripe for a comeback. We have demonstrated that two simple baselines that modify the cross-entropy loss are able to outperform the few modern methods tackling this problem. Improvements in this task are undoubtedly possible, but it is important to note the delicate balance between standard top-1 accuracy and mistake severity. Our hope is that the results presented in this paper are soon to be surpassed by the new competitors that it has inspired.


Appendix A Pruning the WordNet hierarchy

The ImageNet dataset [russakovsky2015imagenet] was generated by populating the WordNet [miller1998wordnet] hierarchy of nouns with images. WordNet is structured as a graph composed of a set of IS-A parent-child relationships. Similarly to the work of Morin & Bengio [morin2005hierarchical] and Redmon & Farhadi [redmon2017yolo9000], our proposed hierarchical cross entropy loss (HXE, Sec. 3.1) also relies on the assumption that the hierarchy underpinning the data takes the form of a tree. Therefore, we modified the hierarchy to obtain a tree from the WordNet graph.

First, for each class, we found the longest path from the corresponding node to the root. This amounts to selecting the paths with the highest discriminative power with respect to the image classes. When multiple such paths existed, we selected the one with the minimum number of new nodes and added it to the new hierarchy. Second, we removed the few non-leaf nodes with a single child, as they do not possess any discriminative power.

Finally, we observed that the pruned hierarchy’s root is not physical entity, as one would expect, but rather the more general entity. This is problematic, since entity contains both physical objects and abstract concepts, while tieredImageNet-H classes only represent physical objects. Upon inspection, we found that this was caused by the classes bubble, traffic sign, and traffic lights being connected to sphere and sign, which are considered abstract concepts in the WordNet hierarchy. Instead, we connected them to sphere, artifact and signboard, respectively, thus connecting them to physical entity.

Even though our second proposed method (soft labels), as well as the cross-entropy baseline, DeViSE [frome2013devise] and Barz & Denzler [barz2019hierarchy], do not make any assumption regarding the structure of the hierarchy, we ran them using this obtained pruned hierarchy for consistency of the experimental setup.

Appendix B More implementation details

In order to perform meaningful comparisons, we adopted a simple configuration (network architecture, optimiser, data augmentation, …) and used it for all the methods presented in this paper. This configuration is already stated in the implementation details of Sec. 4.3, but we report it again here for convenience.

We used a ResNet-18 architecture (with weights pretrained on ImageNet) trained with Adam [reddi2019convergence] for steps and mini-batch size of . We used a learning rate of unless specified otherwise. To prevent overfitting, we adopted PyTorch’s basic data augmentation routines with default hyperparameters: RandomHorizontalFlip() and RandomResizedCrop(). For both datasets, images have been resized to .

Below, we provide further information about the methods we compared against, together with the few minor implementation choices we had to make. As mentioned in Sec. 1, these methods represent, to the best of our knowledge, the only modern attempts to deliberately reduce the semantic severity of a classifier’s mistakes that are generally applicable to any modern architecture.

YOLO-v2. In motivating the hierarchical variant of the YOLO-v2 framework, Redmon & Farhadi [redmon2017yolo9000, Sec. 4], mention the need of integrating the smaller COCO detection dataset [lin2014microsoft]

with the larger ImageNet classification dataset under a unified class hierarchy. Their approach too relies on a heuristic for converting the WordNet graph into a tree, and then effectively training a conditional classifier at every parent node in the tree by using one softmax layer per sibling group and training under the usual softmax loss over leaf posteriors. The authors report only a marginal drop in standard classification accuracy when enforcing this tree-structured prediction, including the additional internal-node concepts. They note that the approach brings benefits, including graceful degradation on new or unknown object categories, as the network is still capable of high confidence in a parent class when unsure as to which of its children is correct.

Since the model outputs conditional probabilities instead of class probabilities, we changed the output dimension of the terminal fully-connected layer, such that it outputs logits for every node in the hierarchy. Proper normalisation of the conditional probabilities is then enforced at every node of the hierarchy using the softmax function. Finally, the loss is computed by summing the individual cross-entropies of the conditional probabilities on the path connecting the ground-truth label to the root of the tree.


model.fc = torch.nn.Sequential(
Listing 1: Network head used for DeViSE.

Frome et al[frome2013devise] proposed DeViSE with the aim of both making more semantically reasonable errors and enabling zero-shot prediction. The approach involves modifying a standard deep classification network to instead output vectors representing semantic embeddings of the class labels. The label embeddings are learned through analysis of unannotated text [mikolov2013efficient] in a separate step, with the classification network modified by replacing the softmax layer with a learned linear mapping to that embedding space. The loss function is a form of ranking loss which penalises the extent of greater cosine similarity to negative examples than positive ones. Inference comprises finding the nearest class embedding vectors to the output vector, again under cosine similarity.

Since an official implementation of DeViSE is not available to the public, we re-implemented it following the details discussed in the paper [frome2013devise]. Below the list of changes we found appropriate to make.

  • For the generation of the word embeddings, instead of the rather dated method of Mikolov et al[mikolov2013efficient], we used the high-performing and publicly available444 fastText library [bojanowski2017enriching] to obtain word embeddings of length 300 (the maximum made available by the library).

  • Instead of a single fully-connected layer mapping the network output to the word embeddings, we used the network “head” described in Listing 1. We empirically verified that this configuration with two fully-connected layers outperforms the one with a single fully-connected layer. Moreover, in this way the number of parameters of DeViSE roughly matches the one of the other experiments, which have architectures with a single fully-connected layer but a higher number of outputs (608, equivalent to the number of classes of tieredImageNet-H, as opposed to 300, the word-embedding size).

  • Following what described in [frome2013devise], we performed training in two steps. First, we trained only the fully-connected layers for the first steps with a learning rate of . We then trained the entire network for exta epochs, using a learning rate of for the weights of the backbone. Note that [frome2013devise] did not specify neither how long the two steps of training should last nor the values of the respective learning rates. To decide the above values, we performed a small hyperparameter search.

  • [frome2013devise] says that DeViSE is trained starting from an ImageNet-pretrained architecture. Since we evaluated all methods on tieredImageNet-H, we instead initialised DeViSE weights with the ones of an architecture fully trained with the cross-entropy loss on this dataset. We verified that this obtains better results than starting training from ImageNet weights.

Barz&Denzler [barz2019hierarchy]. This approach involves first mapping class labels into a space in which dot products represent semantic similarity (based on normalised LCA height), then training a deep network to learn matching feature vectors (before the fully connected layer) on its inputs. There is a very close relationship to DeViSE [frome2013devise], with the main difference being that here, the label embedding is derived from a supplied class hierarchy in a straightfoward manner instead of via text analysis: iterative arrangement of embedding vectors such that all dot products equal respective semantic similarities. The authors experiment with two different loss functions: (1) a linear reward for the dot product between the output feature vector and ground-truth class embedding (i.e. a penalty on misalignment); and (2) the sum of the preceding and a weighted term of the usual cross-entropy loss on the output of an additional fully connected layer with softmax activation, for classification. We only used (2), since in [barz2019hierarchy] it attains significantly better results than (1).

We used the code released by the authors 555 to produce the label embeddings. To ensure consistency with the other experiments, two differences in implementation with respect to the original paper were required.

  • We simply used a ResNet-18 instead of the architectures Barz & Denzler experimented with in their paper [barz2019hierarchy] (i.e. ResNet-110w [he2016deep], PyramidNet-272-200 [han2017deep] and Plain-11 [barz2018deep]).

  • Instead of SGD with warm restarts [loshchilov2016sgdr], we used Adam [reddi2019convergence] with a learning rate of (the value performing best on the validation set) for steps.

Appendix C Outputting conditional probabilities with HXE

We also investigated whether outputting conditional probabilities instead of class probabilities affects the performance of the classifier represented by our proposed HXE approach (Sec. 3.1). These two options correspond, respectively, to implementing hierarchical information as an architectural change or as modification of the loss only.

Comparing different values of for otherwise identical training parameters, we observe that outputting the class probabilities consistently results in an improvement of performance across all of our metrics, see Suppl. Fig. D. Moreover, directly considering the class probabilities has also the advantage of not requiring direct knowledge of the hierarchy at test time.

Appendix D Supplementary figures

Supplementary Figure 1: Distribution of mistake severity when picking random example pairs in tieredImageNet-H. Note that even though this distribution shares some similarities with the ones shown in Fig. 1, it is substantially different. This indicates that the general shape of the distributions of the mistake severities for the various DNN architectures investigated here cannot be explained by properties of the dataset alone.