1 Introduction
Image classification networks have improved greatly over recent years, but generalisation remains imperfect, and testtime errors do of course occur. Conventionally, such errors are defined with respect to a single groundtruth class and reported using one or more top measures ( typically set to or
). However, this practice imposes certain notions of what it means to make a mistake, including treating all classes other than the “true” label as equally wrong. This may not actually correspond to our intuitions about desired classifier behaviour, and for some applications this point may prove crucial. Take the example of an autonomous vehicle observing an object on the side of the road: whatever measure of classifier performance we use, we can certainly agree that mistaking a lamppost for a tree is less of a problem than mistaking a person for a tree, as such a mistake would have crucial implications in terms both of prediction and planning. If we want to take such considerations into account, we must incorporate a nontrivial model of the relationships between classes, and accordingly rethink more broadly what it means for a network to “make a mistake”. One natural and convenient way of representing these class relationships is through a taxonomic hierarchy tree.
This idea is not new. In fact, it was once fairly common across various machine learning application domains to consider class hierarchy when designing classifiers, as surveyed in Silla & Freitas
[silla2011survey]. That work assembled and categorised a large collection of hierarchical classification problems and algorithms, and suggested widely applicable measures for quantifying classifier performance in the context of a given class hierarchy. The authors noted that the hierarchyinformed classifiers of the era typically empirically outperformed “flat” (i.e. hierarchyagnostic) classifiers even under standard metrics, with the performance gap increasing further under the suggested hierarchical metrics. Furthermore, class hierarchy is at the core of the ImageNet dataset: as detailed in Deng et al. [deng2009imagenet], it was constructed directly from WordNet [miller1998wordnet], itself a hierarchy originally designed solely to represent semantic relationships between words. Shortly after ImageNet’s introduction, works such as Deng et al. [deng2010does], Zhao et al. [zhao2011large], and Verma et al. [verma2012learning] explicitly noted that the underpinning WordNet hierarchy suggested a way of quantifying the severity of mistakes, and experimented with minimising hierarchical costs accordingly. Likewise, Deng et al. [deng2011hierarchical]presented a straightforward method for using a hierarchyderived similarity matrix to define a more semantically meaningful compatibility function for image retrieval. Despite this initial surge of interest and the promising results accompanying it, the community shortly decided that hierarchical measures were not communicating substantially different information about classifier performance than top
measures^{1}^{1}1From Russakovsky et al. [russakovsky2015imagenet]: “[..] we found that all three measures of error (top5, top1, and hierarchical) produced the same ordering of results. Thus, since ILSVRC2012 we have been exclusively using the top5 metric which is the simplest and most suitable to the dataset.”, and effectively discarded them. When the celebrated results in Krizhevsky et al. [krizhevsky2012imagenet] were reported in flat top terms only, the precedent was firmly set for the work which followed in the deep learning era of image classification. Interest in optimising hierarchical performance measures waned accordingly.We argue here that this problem is ripe for revisitation, and we begin by pointing to Fig. 1. Here, a mistake is defined as a top1 prediction which differs from the groundtruth class, and the severity of such a mistake is the height of the lowest common ancestor of the predicted and groundtruth classes in the hierarchy. We see that while the flat top1 accuracies of stateoftheart classifiers have improved to impressive levels over the years, the distributions of the severities of the errors that are made have changed very little over this time. We hypothesise that this is due, at least in part, to the scarcity of modern learning methods which attempt to exploit prior information and preferences about class relationships in the interest of “making better mistakes”, whether this information is sourced from an offline taxonomy or otherwise. The few exceptions of which we are aware include Frome et al. [frome2013devise], Wu et al. [wu2016learning], Barz & Denzler [barz2019hierarchy], and a passing mention in Redmon & Farhadi [redmon2017yolo9000]. In Sec. 2, we suggest a framework for thinking about these pieces of work, their predecessors, and some of their conceptual relatives.
The contributions of this work are as follows:

We review relevant literature within an explanatory framework which unifies a fairly disjoint prior art.

Building on the perspective gained from the preceding, we propose two methods that are both simple and effective at leveraging class hierarchies.

We perform an extensive experimental evaluation to both demonstrate the effectiveness of the said methods compared to prior art and to encourage future work on the topic.
To ensure reproducibility, the PyTorch
[paszke2019pytorch] code of all our experiments will be made available at github.com/fiveai/makingbettermistakes.2 Framework and related work
We first suggest a simple framework for thinking about methods relevant to the problem of making better mistakes on image classification, beginning with the standard supervised setup. Consider a training set which pairs images with class labels . A network architecture implements the predictor function , whose parameters are learned by minimising the empirical risk
(1) 
where the loss function
compares the predictor’s output to an embedded representation of each example’s class, and is a regulariser.Under common choices such as crossentropy for and onehot embedding for , it is easy to see that the framework is agnostic of relationships between classes. The question is how such class relationships can be incorporated into the loss in Eqn. 1. We identify the following three approaches:

Replacing class representation with an alternate embedding . Such “labelembedding” methods, discussed in Sec. 2.1, can draw their embedding both from taxonomic hierarchies and alternative sources.

Altering the loss function in terms of its arguments to produce , i.e. making the penalty assigned to a given output distribution and embedded label dependent on . Methods using these “hierarchical losses” are covered in Sec. 2.2.

Altering the function to , i.e. making hierarchicallyinformed architectural changes to the network, generally with the hope of introducing a favourable inductive bias. We cover these “hierarchical architectures” in Sec. 2.3 .
While a regulariser is certainly feasible, it is curiously rare in practice: [zhao2011large] is the only example we know of.
2.1 Labelembedding methods
These methods map class labels to vectors whose relative locations represent semantic relationships, and optimise a loss on these embedded vectors. The DeViSE method of Frome
et al. [frome2013devise] maps target classes onto a unit hypersphere via the method of Mikolov et al. [mikolov2013efficient], assigning terms with similar contexts to similar representations through analysis of unannotated Wikipedia text. The loss function is a ranking loss which penalises the extent to which the output is more cosinesimilar to false label embeddings than to the correct one. They learn a linear mapping from a pretrained visual feature pipeline (based on AlexNet) to the embedded labels, then finetune the visual pipeline itself through backpropagation. RomeraParedes & Torr
[romera2015embarrassingly] note that their oneline solution for learning an analogous linear mapping for zeroshot classification should easily extend to accommodating these sorts of embeddings. In Hinton et al. [hinton2015distilling], the role of the label embedding function is played by a temperaturescaled preexisting classifier ensemble. This ensemble is “distilled” into a smaller DNN through crossentropy minimisation against the ensemble’s output^{2}^{2}2This refers to the “distillation” section of [hinton2015distilling]. There is a separate hierarchical classification section discussed later.. For zeroshot classification, Xian et al. [xian2016latent] experiment with various independent embedding methods, as is also done in Akata et al. [akata2015evaluation]: annotated attributes, word2vec [mikolov2013distributed], glove [pennington2014glove], and the WordNet hierarchy. Their ranking loss function is functionally equivalent to that in Frome et al. [frome2013devise], and they learn a choice of linear mappings to these representations from the visual features output by a fixed CNN. Barz & Denzler [barz2019hierarchy]present an embedding algorithm which iteratively maps examples onto a hypersphere such that all cosine distances represent similarities derived from lowest common ancestor (LCA) height in a given hierarchy tree. They train a full deep model, and alternately present two different losses: (1) a linear loss based on cosine distance to the embedded class vectors, used for retrieval, and (2) the standard crossentropy loss on the output of a fully connected/softmax layer added after the embedding layer, for classification.
2.2 Hierarchical losses
In these methods, the loss function itself is parametrised by the class hierarchy such that a higher penalty is assigned to the prediction of a more distant relative of the true label. Deng et al. [deng2010does]
simply train kNN and SVMbased classifiers to minimise the expected WordNet LCA height directly. Zhao
et al. [zhao2011large]modify standard multiclass logistic regression by recomputing output class probabilities as the normalised classsimilarityweighted sums of all class probabilities calculated as per the usual regression. They also regularise feature selection using an “overlappinggrouplasso penalty” which encourages the use of similar features for closely related classes, a rare example of a hierarchical regulariser. Verma
et al. [verma2012learning] incorporate normalised LCA height into a “contextsensitive loss function” while learning a separate metric at each node in a taxonomy tree for nearestneighbour classification. Wu et al. [wu2016learning] consider the specific problem of food classification. They begin with a standard deep network pipeline and add one fully connected/softmax layer to the end of the (shared) pipeline for each level of the hierarchy, so as to explicitly predict each example’s lineage. The loss is the sum of standard log losses over all of the levels, with a single userspecified coefficient for reweighting the loss of all levels other than the bottom one. Because this method is inherently susceptible to producing marginal probabilities that are inconsistent across levels, it incorporates a subsequent labelpropagation smoothing step. Alsallakh et al. [alsallakh2017convolutional] likewise use a standard deep architecture as their starting point, but instead add branches strategically to intermediate pipeline stages. They thereby force the net to classify into offlinedetermined superclasses at the respective levels, backpropagating error in these intermediate predictions accordingly. On deployment, these additions are simply discarded.2.3 Hierarchical architectures
These methods attempt to incorporate class hierarchy into the classifier architecture without necessarily changing the loss function otherwise. The core idea is to “divide and conquer” at the structural level, with the classifier assigning inputs to superclasses at earlier layers and making finegrained distinctions at later ones. In the context of language models, it was noted at least as early as Goodman [goodman2001classes] that classification with respect to an ISA hierarchy tree could be formulated as a tree of classifiers outputting conditional probabilities, with the product of the conditionals along a given leaf’s ancestry representing its posterior; motivated by efficiency, Morin & Bengio [morin2005hierarchical] applied this observation to a binary hierarchy derived from WordNet. Redmon & Farhadi [redmon2017yolo9000] propose a modern deeplearning variant of this framework in the design of the YOLOv2 object detection and classification system. Using a version of WordNet pruned into a tree, they effectively train a conditional classifier at every parent node in the tree by using one softmax layer per sibling group and training under the usual crossentropy loss over leaf posteriors. While their main aim is to enable the integration of the COCO detection dataset with ImageNet, they suggest that graceful degradation on new or unknown object categories might be an incidental benefit. Brust & Denzler [brust2018integrating] propose an extension of conditional classifier chains to the more general case of DAGs.
The above approaches can be seen as a limiting case of hierarchical classification, in which every split in the hierarchy is cast as a separate classification problem. Many hierarchical classifiers fall between this extreme and that of flat classification, working in terms of a coarsergrained conditionality in which a “generalist” makes hard or soft assignments to groupings of the target classes before then distinguishing the group members from one another using “experts”. Xiao et al. [xiao2014error], the quasiensemble section of Hinton et al. [hinton2015distilling], Yan et al. [yan2015hd], and Ahmed et al. [ahmed2016network] all represent modern variations on this theme (which first appears no later than [jacobs1991adaptive]). Additionally, the listed methods all use some form of lowlevel feature sharing either via architectural constraint or parameter cloning, and all infer the visual hierarchy dynamically through confusion clustering or latent parameter inference. Alsallakh et al. [alsallakh2017convolutional] make the one proposal of which we are aware which combines hierarchical architectural modifications (at train time) with a hierarchical loss, as described in Sec. 2.2. At test time, however, the architecture is that of an unmodified AlexNet, and all superclass “assignment” is purely implicit.
3 Method
We now outline two simple methods that allow us to leverage class hierarchies in order to make better mistakes on image classification. We concentrate on the case where the output of the network is a categorical distribution over classes for each input image and denote the corresponding distribution as , where subscripts denote vector indices and and are omitted for brevity. In Sec. 3.1, we describe the hierarchical crossentropy (HXE), a straightforward example of the hierarchical losses reviewed in Sec. 2.2. This approach expands each class probability into the chain of conditional probabilities defined by its unique lineage in a given hierarchy tree. It then reweights the corresponding terms in the loss so as to penalise classification mistakes in a way that is informed by the hierarchy. In Sec. 3.2, we suggest an easy choice of embedding function to implement the labelembedding framework covered in Sec. 2.1. The resulting soft labels are PMFs over whose values decay exponentially w.r.t. an LCAbased distance to the ground truth.
3.1 Hierarchical crossentropy
When the hierarchy is a tree, it corresponds to a unique factorisation of the categorical distribution over classes in terms of the conditional probabilities along the path connecting each class to the root of the tree. Denoting the path from a leaf node to the root as , the probability of class can be factorised as
(2) 
where , , and is the height of the node . Note that we have omitted the last term . Conversely, the conditionals can be written in terms of the class probabilities as
(3) 
where denotes the set of leaf nodes of the subtree starting at node .
A direct way to incorporate hierarchical information in the loss is to hierarchically factorise the output of the classifier according to Eqn. 2 and define the total loss as the reweighted sum of the crossentropies of the conditional probabilities. This leads us to define the hierarchical crossentropy (HXE) as
(4) 
where is the weight associated with the edge node , see Fig. 2a. Even though this loss is expressed in terms of conditional probabilities, it can be easily applied to models that output class probabilities using Eqn. 3. Note that reduces to the standard crossentropy when all weights are equal to . This limit case, which was briefly mentioned by Redmon & Farhadi in their YOLOv2 paper [redmon2017yolo9000], results only in architectural changes but does not incorporate hierarchical information in the loss directly.
Eqn. 4 has an interesting informationtheoretical interpretation: since each term corresponds to the the information associated with the edge in the hierarchy, the HXE corresponds to discounting the information associated with each of these edges differently. Note that since the HXE is expressed in terms of conditional probabilities, the reweighting in Eqn. 4 is not equivalent to reweighting the crossentropy for each possible ground truth class independently (as done, for instance, in [lin2017focal, cui2019class]).
A sensible choice for the weights is to take
(5) 
where is the height of node and
is a hyperparameter that controls the extent to which information is discounted down the hierarchy. The higher the value of
, the higher the preference for “generic” as opposed to “finegrained” information, because classification errors related to nodes further away from the root receive a lower loss. While such a definition has the advantage of simplicity, one could think of other meaningful weightings, such as ones depending on the branching factor of the tree or encoding a preference towards specific classes. We concentrate on equation 5 here, as it is both simple and easily interpretable while leaving a more systematic exploration of different weighting strategies for future work.3.2 Soft labels
Our second approach to incorporating hierarchical information, soft labels, is a labelembedding approach as described in Sec. 2.1. These methods use a mapping function to associate classes with representations which encode classrelationship information that is absent in the trivial case of the onehot representation. In the interest of simplicity, we choose a mapping function which outputs a categorical distribution over the classes. This enables us to simply use the standard crossentropy loss:
(6) 
where the soft label embedding is given componentwise by
(7) 
for class distance function and parameter . This loss is illustrated in Fig. 2b. For the distance function , we use the height of LCA divided by the height of the tree. To understand the role of the hyperparameter , note that values of that are much bigger than the typical inverse distance in the tree result in a label distribution that is nearly onehot, i.e. , in which case the crossentropy reduces to the familiar singleterm logloss expression. Conversely, for very small values of the label distribution is nearuniform. Between these extremes, greater probability mass is assigned to classes more closely related to the ground truth, with the magnitude of the difference controlled by .
We offer two complementary interpretations that motivate this representation (besides its ease). For one, the distribution describing each target class can be considered to be a model of the actual uncertainty that a labeller (e.g. human) would experience due to visual confusion between closely related classes. It could also be thought of as encoding the extent to which a common response to different classes is required of the classifier, i.e. the imposition of correlations between outputs, where higher correlations are expected for more closely related classes. This in turn suggests a connection to the superficially different but conceptually related distillation method of Hinton et al. [hinton2015distilling], in which correlations between a large network’s responses to different classes are mimicked by a smaller network to desirable effect. Here, we simply supply these correlations directly, using widely available hierarchies.
Another important connection is the one to the technique of label smoothing [szegedy2016rethinking]
, in which the “peaky” distribution represented by a onehot label is combined with the uniform distribution. This technique has been widely adopted to regularise the training of large neural networks (
e.g. [szegedy2016rethinking, chorowski2017towards, vaswani2017attention, zoph2018learning]), but has only very recently [muller2019does] been studied more thoroughly.4 Evaluation
In the following, we first describe the datasets (Sec. 4.1) and metrics (Sec. 4.2) comprising the setup common to all of our experiments. Then, in Sec. 4.3, we empirically evaluate our two simple proposals and compare them to the prior art. Finally, we experiment with random hierarchies to understand when and how information on class relatedness can help classification.
4.1 Datasets
In our experiments, we use tieredImageNet [ren2018meta] (a large subset of ImageNet/ILSVRC’12 [russakovsky2015imagenet]) and iNaturalist’19 [van2018inaturalist], two datasets with hierarchies that are a) significantly different from one another and b) complex enough to cover a large number of visual concepts. ImageNet aims to populate the WordNet [miller1998wordnet] hierarchy of nouns, with WordNet itself generated by inspecting ISA lexical relationships. By contrast, iNaturalist’19 [van2018inaturalist] has a biological taxonomy [ruggiero2015higher] at its core.
tieredImageNet was originally introduced by Ren et al. [ren2018meta] for the problem of fewshot classification, in which the sets of classes between dataset splits are disjoint. The authors’ motivation in creating the dataset was to use the WordNet hierarchy to generate splits containing significantly different classes, facilitating better assessment of fewshot classifiers by enforcing problem difficulty.
Although our task and motivations are different, we chose this dataset because of the large portion of the WordNet hierarchy spanned by its classes. To make it suitable for the problem of (standard) image classification, we resampled the dataset so as to represent all classes across the train, validation, and test splits. Moreover, since the method proposed in Section 3.1 and YOLOv2 [redmon2017yolo9000] require that the graph representing the hierarchy is a tree, we modified the graph of the spanned WordNet hierarchy slightly to comply with this assumption (more details available in Appendix A). After this procedure, we obtained a tree of height and images from different classes, which we randomly assigned to training, validation, and test splits with respective probabilities , , and . We refer to this modified version of tieredImageNet as tieredImageNetH.
iNaturalist is a dataset of images of organisms that has so far mainly been used to evaluate finegrained visual categorisation methods. The dataset construction protocol differs significantly from the one used for ImageNet in that it relies on passionate volunteers instead of workers paid per task [van2018inaturalist]. Importantly, for the 2019 edition of the CVPR FineGrained Visual Categorization Workshop, metadata with hierarchical relationships between species have been released. In contrast to WordNet, this taxonomy is an 8level complete tree that can readily be used in our experiments without modifications. Since the labels for the test set are not public, we randomly resampled three splits from the total of images from classes, again with probabilities , , and for the training, validation, and test set, respectively. We refer to this modified version of iNaturalist’19 as iNaturalist19H.
4.2 Metrics
We consider three measures of performance, covering different interpretations of a classifier’s mistakes.
Top error. Under this measure, an example is defined as correctly classified if the ground truth is among the top classes with the highest likelihood. This is the measure normally used to compare classifiers, usually with or . Note that this measure considers all mistakes of the classifier equally, irrespective of how “similar” the predicted class is to the ground truth.
Hierarchical measures. We also consider measures that, in contrast to the top error, do weight the severity of mistakes. We use the height of the lowest common ancestor (LCA) between the predicted class and the ground truth as a core severity measure, as originally proposed in the papers describing the creation of ImageNet [deng2009imagenet, deng2010does]. As remarked in [deng2010does], this measure should be thought of in logarithmic terms, as the number of confounded classes is exponential in the height of the ancestor. We also experimented with the JiangConrath distance as suggested by Deselaers & Ferrari [deselaers2011visual], but did not observe meaningful differences wrt. the height of the LCA.
We consider two measures that utilise the height of the LCA between nodes in the hierarchy.

The hierarchical distance of a mistake is the height of the LCA between the ground truth and the predicted class when the input is misclassified, i.e. when the class with the maximum likelihood is incorrect. Hence, it measures the severity of misclassification when only a single class can be considered as a prediction.

The average hierarchical distance of top, instead, takes the mean LCA height between the ground truth and each of the most likely classes. This measure is important, for example, when multiple hypotheses of a classifier can be considered for a certain downstream task.
4.3 Experimental results
In the following, we analyse the performance of the two approaches described in Sec. 3.1 and Sec. 3.2, which we denote by HXE and soft labels, respectively. Besides a vanilla crossentropybased flat classifier, we also implemented and compared against the methods proposed by Redmon & Farhadi [redmon2017yolo9000] (YOLOv2)^{3}^{3}3Note that this refers to the conditional classifier subsystem proposed in Sec. 4 of that work, not the main object detection system., Frome et al. [frome2013devise] (DeViSE), and Barz & Denzler [barz2019hierarchy]. As mentioned in Sec. 1, these methods represent, to the best of our knowledge, the only modern attempts to deliberately reduce the semantic severity of a classifier’s mistakes that are generally applicable to any modern architecture. Note, though, that we do not run DeViSE on iNaturalist19H, as the class IDs of this dataset are alien to the corpus used by word2vec [mikolov2013efficient].
Implementation details. Since we are interested in understanding the mechanisms by which the above metrics can be improved, it is essential to use a simple configuration that is common between all of the algorithms taken into account.
We use a ResNet18 architecture (with weights pretrained on ImageNet) trained with Adam [reddi2019convergence] for steps and minibatches of size . We use a learning rate of unless specified otherwise. To prevent overfitting, we adopt PyTorch’s basic data augmentation routines with default hyperparameters: RandomHorizontalFlip() and RandomResizedCrop().
Further implementation details for all of the methods are deferred to Appendix B.
Main results. In Fig. 3 and 4 we show how it is possible to effectively trade off top1 error to reduce hierarchical error, by simply adjusting the hyperparameters and in Eqn. 5 and 7. Specifically, increasing corresponds to (exponentially) discounting information down the hierarchy, thus more severely penalising mistakes where the predicted class is further away from the ground truth. Similarly, decreasing in the softlabel method amounts to progressively shifting the label mass away from the ground truth and towards the neighbouring classes. Both methods reduce to the crossentropy in the respective limits and . Moreover, notice that varying affects the entropy of the distribution representing a soft label, where the two limit cases are for the standard onehot case and for the uniform distribution. We experiment with and .
To limit noise in the evaluation procedure, for both of our methods and all of the competitors, we fit a 4thdegree polynomial to the validation loss (after having discarded the first
training steps) and pick the epoch corresponding to its minimum along with its four neighbours. Then, to produce the points reported in our plots, we average the results obtained from these five epochs on the validation set, while reserving the test set for the experiments of Table
1. Notice how, in Fig. 4, when considering the hierarchical distance with , methods are almost perfectly aligned along the plot diagonal, which demonstrates the strong linear correlation between this metric and the top1 error. This result is consistent with what is observed in [russakovsky2015imagenet], which, in , led the organisers of the ILSVRC workshop to discard rankings based on hierarchical distance.When considering the other metrics described in Sec. 4.2, a different picture emerges. In fact, a tradeoff between top1 error and hierarchical distance is evident in Fig. 3 and in the plots of Fig. 4 with and . Notice how the points on the plots belonging to our methods outline a set of tradeoffs that subsumes the prior art. For example, in Fig. 3, given any desired tradeoff betweeen top1 error and hierarchical distance of mistakes on tieredImageNetH, it is better to use HXE than any other method. A similar phenomenon is observable when considering the average hierarchical distance of top5 and top20 (Fig. 4), although in these cases it is better to use the soft labels. The only exception to this trend is represented by Barz & Denzler [barz2019hierarchy] on tieredImageNetH, which can achieve slightly lower average hierarchical distance for or at a significant cost in terms of top1 error.
Hier. dist. mistake  Avg. hier. dist. @1  Avg. hier. dist. @5  Avg. hier. dist. @20  Top1 error  
Crossentropy  
Barz&Denzler [barz2019hierarchy]  
YOLOv2 [redmon2017yolo9000]  
DeViSE [frome2013devise]  
HXE (ours)  
HXE (ours)  
Softlabels (ours)  
Softlabels (ours)  
Crossentropy  
Barz&Denzler [barz2019hierarchy]  
YOLOv2 [redmon2017yolo9000]  
HXE (ours)  
HXE (ours)  
Softlabels (ours)  
Softlabels (ours) 
Using the results illustrated in Fig. 3 and 4, we pick two reasonable operating points for both of our proposals: one for the highdistance/lowtop1error regime, and one for the lowdistance/hightop1error regime. We then run both of these configurations on the test sets and report our results on Table 1
. The means are again obtained from the five best epochs, and we use the standard deviation to compute 95% confidence intervals.
The trends observed on the validation set largely repeat themselves on the test set. When one desires to prioritise top1 error, then soft labels with high or HXE with low are more appropriate, as they outperform the crossentropy on the hierarchicaldistancebased metrics while being practically equivalent in terms of top1 error. In cases where the hierarchical measures should be prioritised instead, it is preferable to use soft labels with low or HXE with high , depending on the particular choice of hierarchical metric. Although the method of Barz & Denzler is competitive in this regime, it also exhibits the worst deterioration in top1 error with respect to the crossentropy.
Our experiments generally indicate, over all tested methods, an inherent tension between performance in the top1 sense and in the hierarchical sense. We speculate that there may be a connection between this tension and observations proceeding from the study of adversarial examples indicating a tradeoff between robustness and (conventional) accuracy, as in e.g. [tsipras2018robustness, zhang2019theoretically].
Can hierarchies be arbitrary? Although the lexical WordNet hierarchy and the biological taxonomy of iNaturalist are not visual hierarchies per se, they arguably reflect meaningful visual relationships between the objects represented in the underlying datasets. Since deep networks leverage visual features, it is interesting to investigate the extent to which the structure of a particular hierarchy is important. In other words, what would happen with an arbitrary hierarchy, one that does not have any relationship with the visual world?
To answer this question, we randomised the nodes of the hierarchies and repeated our experiments. Results on iNaturalist19H are displayed in Fig. 5 (tieredImageNetH exhibits a similar trend). Again, we report tradeoff plots showing top1 errors on the xaxis and metrics based on the height of the LCA (on the randomised hierarchy) on the yaxis. It is evident that the hierarchical distance metrics are significantly worse when using the random hierarchy. Although this is not surprising, the extent to which the results deteriorate is remarkable. This suggests that the inherent nature of the structural relationship expressed by a hierarchy is paramount for learning classifiers that, besides achieving competitive top1 accuracy, are also able to make better mistakes.
Curiously, for the soft labels, the top1 error of the random hierarchy is consistently lower than its “real” hierarchy counterpart. We speculate this might be due to the structural constraints imposed by a hierarchy anchored to the visual world, which can limit a neural network from opportunistically learning correlations that allow it to achieve low top1 error (at the expense of ever more brittle generalisation). Indeed, the authors of [zhang2016understanding] noted that it is more difficult to train a deep network to map real images to random labels than it is to do so with random images. The most likely explanation for this is that common visual features, which are inescapably shared by closely related examples, dictate common responses.
5 Conclusion
Since the advent of deep learning, the computer vision community’s interest in making better classification mistakes seems to have nearly vanished. In this paper, we have shown that this problem is still very much open and ripe for a comeback. We have demonstrated that two simple baselines that modify the crossentropy loss are able to outperform the few modern methods tackling this problem. Improvements in this task are undoubtedly possible, but it is important to note the delicate balance between standard top1 accuracy and mistake severity. Our hope is that the results presented in this paper are soon to be surpassed by the new competitors that it has inspired.
References
Appendix A Pruning the WordNet hierarchy
The ImageNet dataset [russakovsky2015imagenet] was generated by populating the WordNet [miller1998wordnet] hierarchy of nouns with images. WordNet is structured as a graph composed of a set of ISA parentchild relationships. Similarly to the work of Morin & Bengio [morin2005hierarchical] and Redmon & Farhadi [redmon2017yolo9000], our proposed hierarchical cross entropy loss (HXE, Sec. 3.1) also relies on the assumption that the hierarchy underpinning the data takes the form of a tree. Therefore, we modified the hierarchy to obtain a tree from the WordNet graph.
First, for each class, we found the longest path from the corresponding node to the root. This amounts to selecting the paths with the highest discriminative power with respect to the image classes. When multiple such paths existed, we selected the one with the minimum number of new nodes and added it to the new hierarchy. Second, we removed the few nonleaf nodes with a single child, as they do not possess any discriminative power.
Finally, we observed that the pruned hierarchy’s root is not physical entity, as one would expect, but rather the more general entity. This is problematic, since entity contains both physical objects and abstract concepts, while tieredImageNetH classes only represent physical objects. Upon inspection, we found that this was caused by the classes bubble, traffic sign, and traffic lights being connected to sphere and sign, which are considered abstract concepts in the WordNet hierarchy. Instead, we connected them to sphere, artifact and signboard, respectively, thus connecting them to physical entity.
Even though our second proposed method (soft labels), as well as the crossentropy baseline, DeViSE [frome2013devise] and Barz & Denzler [barz2019hierarchy], do not make any assumption regarding the structure of the hierarchy, we ran them using this obtained pruned hierarchy for consistency of the experimental setup.
Appendix B More implementation details
In order to perform meaningful comparisons, we adopted a simple configuration (network architecture, optimiser, data augmentation, …) and used it for all the methods presented in this paper. This configuration is already stated in the implementation details of Sec. 4.3, but we report it again here for convenience.
We used a ResNet18 architecture (with weights pretrained on ImageNet) trained with Adam [reddi2019convergence] for steps and minibatch size of . We used a learning rate of unless specified otherwise. To prevent overfitting, we adopted PyTorch’s basic data augmentation routines with default hyperparameters: RandomHorizontalFlip() and RandomResizedCrop(). For both datasets, images have been resized to .
Below, we provide further information about the methods we compared against, together with the few minor implementation choices we had to make. As mentioned in Sec. 1, these methods represent, to the best of our knowledge, the only modern attempts to deliberately reduce the semantic severity of a classifier’s mistakes that are generally applicable to any modern architecture.
YOLOv2. In motivating the hierarchical variant of the YOLOv2 framework, Redmon & Farhadi [redmon2017yolo9000, Sec. 4], mention the need of integrating the smaller COCO detection dataset [lin2014microsoft]
with the larger ImageNet classification dataset under a unified class hierarchy. Their approach too relies on a heuristic for converting the WordNet graph into a tree, and then effectively training a conditional classifier at every parent node in the tree by using one softmax layer per sibling group and training under the usual softmax loss over leaf posteriors. The authors report only a marginal drop in standard classification accuracy when enforcing this treestructured prediction, including the additional internalnode concepts. They note that the approach brings benefits, including graceful degradation on new or unknown object categories, as the network is still capable of high confidence in a parent class when unsure as to which of its children is correct.
Since the model outputs conditional probabilities instead of class probabilities, we changed the output dimension of the terminal fullyconnected layer, such that it outputs logits for every node in the hierarchy. Proper normalisation of the conditional probabilities is then enforced at every node of the hierarchy using the softmax function. Finally, the loss is computed by summing the individual crossentropies of the conditional probabilities on the path connecting the groundtruth label to the root of the tree.
DeViSE.
Frome et al. [frome2013devise] proposed DeViSE with the aim of both making more semantically reasonable errors and enabling zeroshot prediction. The approach involves modifying a standard deep classification network to instead output vectors representing semantic embeddings of the class labels. The label embeddings are learned through analysis of unannotated text [mikolov2013efficient] in a separate step, with the classification network modified by replacing the softmax layer with a learned linear mapping to that embedding space. The loss function is a form of ranking loss which penalises the extent of greater cosine similarity to negative examples than positive ones. Inference comprises finding the nearest class embedding vectors to the output vector, again under cosine similarity.
Since an official implementation of DeViSE is not available to the public, we reimplemented it following the details discussed in the paper [frome2013devise]. Below the list of changes we found appropriate to make.

For the generation of the word embeddings, instead of the rather dated method of Mikolov et al. [mikolov2013efficient], we used the highperforming and publicly available^{4}^{4}4https://github.com/facebookresearch/fastText fastText library [bojanowski2017enriching] to obtain word embeddings of length 300 (the maximum made available by the library).

Instead of a single fullyconnected layer mapping the network output to the word embeddings, we used the network “head” described in Listing 1. We empirically verified that this configuration with two fullyconnected layers outperforms the one with a single fullyconnected layer. Moreover, in this way the number of parameters of DeViSE roughly matches the one of the other experiments, which have architectures with a single fullyconnected layer but a higher number of outputs (608, equivalent to the number of classes of tieredImageNetH, as opposed to 300, the wordembedding size).

Following what described in [frome2013devise], we performed training in two steps. First, we trained only the fullyconnected layers for the first steps with a learning rate of . We then trained the entire network for exta epochs, using a learning rate of for the weights of the backbone. Note that [frome2013devise] did not specify neither how long the two steps of training should last nor the values of the respective learning rates. To decide the above values, we performed a small hyperparameter search.

[frome2013devise] says that DeViSE is trained starting from an ImageNetpretrained architecture. Since we evaluated all methods on tieredImageNetH, we instead initialised DeViSE weights with the ones of an architecture fully trained with the crossentropy loss on this dataset. We verified that this obtains better results than starting training from ImageNet weights.
Barz&Denzler [barz2019hierarchy]. This approach involves first mapping class labels into a space in which dot products represent semantic similarity (based on normalised LCA height), then training a deep network to learn matching feature vectors (before the fully connected layer) on its inputs. There is a very close relationship to DeViSE [frome2013devise], with the main difference being that here, the label embedding is derived from a supplied class hierarchy in a straightfoward manner instead of via text analysis: iterative arrangement of embedding vectors such that all dot products equal respective semantic similarities. The authors experiment with two different loss functions: (1) a linear reward for the dot product between the output feature vector and groundtruth class embedding (i.e. a penalty on misalignment); and (2) the sum of the preceding and a weighted term of the usual crossentropy loss on the output of an additional fully connected layer with softmax activation, for classification. We only used (2), since in [barz2019hierarchy] it attains significantly better results than (1).
We used the code released by the authors ^{5}^{5}5https://github.com/cvjena/semanticembeddings to produce the label embeddings. To ensure consistency with the other experiments, two differences in implementation with respect to the original paper were required.

We simply used a ResNet18 instead of the architectures Barz & Denzler experimented with in their paper [barz2019hierarchy] (i.e. ResNet110w [he2016deep], PyramidNet272200 [han2017deep] and Plain11 [barz2018deep]).

Instead of SGD with warm restarts [loshchilov2016sgdr], we used Adam [reddi2019convergence] with a learning rate of (the value performing best on the validation set) for steps.
Appendix C Outputting conditional probabilities with HXE
We also investigated whether outputting conditional probabilities instead of class probabilities affects the performance of the classifier represented by our proposed HXE approach (Sec. 3.1). These two options correspond, respectively, to implementing hierarchical information as an architectural change or as modification of the loss only.
Comparing different values of for otherwise identical training parameters, we observe that outputting the class probabilities consistently results in an improvement of performance across all of our metrics, see Suppl. Fig. D. Moreover, directly considering the class probabilities has also the advantage of not requiring direct knowledge of the hierarchy at test time.