1 Introduction
Hierarchical classification addresses the problem of classifying items into a hierarchy of classes. In past years mainstream classification research did not place enough emphasis on the presence of relations between the classes, in our cases hierarchical relations. This is gradually changing and more effort is put into hierarchical classification in particular, partly because many realworld knowledge systems and services use a hierarchical scheme to organize their data (e.g. Yahoo, Wikipedia). Research in hierarchical classification has become important, because flat classification algorithms are illequipped to address large scale problems with hundreds of thousands of hierarchically related classes. Promising initial results on largescale problems show that hierarchical classifiers can be effective in improving information retrieval (Kosmopoulos et al., 2010).
Many research questions in hierarchical classification remain open. An important issue is how to properly evaluate hierarchical classification algorithms. While standard flat classification problems have benefited from established measures such as precision and recall, there are no established evaluation measures for hierarchical classification tasks, where the assessment of an algorithm becomes more complicated due to the relations among the classes. For example, classification errors in the upper levels of the hierarchy (e.g. when wrongly classifying a document of the class
music into the class food) are more severe than those in deeper levels (e.g. when classifying a document from progressive rock as alternative rock). Several evaluation measures have been proposed for hierarchical classification (HC) (Costa et al., 2007; Sokolova and Guy, 2009) using the hierarchy in different ways. Nevertheless, none of them is widely adopted, making it very difficult to compare the performance of different HC algorithms.A number of comparative studies of HC performance measures have been published in the literature. An early study can be found in (Sun et al., 2003), which is limited to a particular type of graphdistance measures. A review of HC measures is presented in (Costa et al., 2007), focusing on singlelabel tasks and without providing any empirical results; in multilabel tasks each object can be assigned to more than one classes, e.g. a newspaper article may belong to both politics and economics. In (Nowak et al., 2010) many multilabel evaluation measures are compared, but the role of the hierarchy is not emphasized. Finally, Brucker et al. (2011) provide a comprehensive empirical analysis of HC performance measures, but they focus on the evaluation of clustering methods rather than classification ones. While these studies provide interesting insights, they all miss important aspects of the problem of evaluating HC algorithms. In particular, they do not abstract the problem in order to describe existing evaluation measures within a common framework.
The work presented here addresses these issues by analyzing and abstracting the key components of existing HC performance measures. More specifically:

It groups existing HC evaluation measures under two main types and provides a generic framework for each type, based on flow networks and set theory.

It provides a critical overview of the existing HC performance measures using the proposed framework.

It introduces two new HC evaluation measures that address important deficiencies of stateoftheart measures.

It provides comparative empirical results on large HC datasets from text classification with a variety of HC algorithms.
The remainder of this paper is organized as follows. Section 2 introduces the problem of HC, presents general requirements for HC measures and the proposed frameworks. Furthermore, it presents existing HC evaluation measures using the proposed frameworks and introduces two new measures that address problems the stateoftheart measures have. Section 3 presents a case study comparison and analysis of the proposed measures and the existing ones. Section 4 describes the empirical setting and data of the empirical analysis of the measures and Section 5 presents and discusses the empirical results. Finally, Section 6 concludes and summarizes remaining open issues.
2 A Framework of Hierarchical Classification Performance Measures
This section presents a new framework within which HC performance measures can be described and characterized. Firstly, supporting notation is defined and then the general requirements for the evaluation are presented and discussed, based on interesting problems that appear in hierarchical classification. We then proceed with the presentation of the proposed framework, which is used in further sections to describe and analyze the measures.
2.1 Notation
In classification tasks the training set is typically denoted as , where
is the feature vector of instance
and is the set of classes in which the instance belongs, where .In contrast to flat classification, where the classes are considered unrelated, in HC the classes are organized in taxonomies. The taxonomies are usually either trees, in which case nodes (classes) have a single parent each, or directed acyclic graphs (DAGs), in which case nodes can have multiple parents; see Figures 1(a) and 1(b) respectively. In some cases, hierarchies may also be cyclic graphs. In all cases the hierarchy imposes a parentchild relation among the classes, which implies that an instance belonging in a specific class, also belongs in all its ancestor classes. A taxonomy is thus usually defined as a pair , where is the set of all classes (Silla and Freitas, 2011) and is the subclassof relationship with the following properties:^{1}^{1}1Without loss of generality, we assume a subclassof relationship among the classes, but in some cases a different relationship may hold, for example partof. We assume however, that the three properties always hold for the relationship.

Asymmetry: if then for every , .

Antireflexivity: for every .

Transitivity: if and , then for every , , .
In graphs with cycles, only the transitivity property holds. In this article we consider only hierarchies without cycles and we denote the descendants and ancestors of a class as and , respectively. The parents of a class are denoted as . Finally, we assume that an instance can be classified in any class of the hierarchy, and not only in the leaf classes.
2.2 General Problems in Hierarchical Classification Evaluation
The commonly used measures of precision, recall, Fmeasure, accuracy etc. are not appropriate for HC, due to the relations that exist among the classes. A hierarchical performance measure should use the class hierarchy in order to evaluate properly HC algorithms. In particular, one must account for several different types of error according to the hierarchy. For example, consider the tree hierarchy in Figure 1(a). Assume that the true class for a test instance is 3.1 and that two different classification systems output 3 and 1 as the predicted classes. Using flat evaluation measures, both systems are punished equally, but the error of the second system is more severe as it makes a prediction in a different and unrelated subtree.
In order to measure the severity of an error in hierarchical classification, there are several interesting issues that need to be addressed. Figure 2 presents five cases that require special handling. In all cases, the nodes surrounded by circles are the true classes, while the nodes surrounded by rectangles are the predicted ones. These cases can be subgrouped in a) pairing problems (Figures 2(d) and 2(e)) where one must select which pairs of predicted and true classes to take into account for the calculation of the error, and b) distancemeasuring problems (Figures 2(c), 2(a) and 2(b)) which concern the way that the error will be calculated for a pair of predicted and true classes.
Figure 2(a) presents an overspecialization error where the predicted class is a descendant of the true class. Figure 2(b) depicts an underspecialization error, where an ancestor of the true class is selected. In both these cases the desired behavior of the measure would be to reduce the penalty to the classification system, according to the distance between the true class and the predicted one.
The third case (Figure 2(c)), called alternative paths, presents a scenario where there are two different ways to reach the true class starting from a predicted class. In this case, a measure could use one of the two paths or both in order to evaluate the performance of the classification system. Selecting the path that minimizes the distance between the two classes and using that as a measure of error seems reasonable. In Figure 2(c) the predicted class is an ancestor of the true class, but an alternative paths case may also involve multiple paths from an ancestor to a descendant predicted class.
Figure 2(d) presents a scenario which is common in multilabel data. In this case one must decide, before even measuring the error, which pairs of true and predicted classes should be compared. For example, node A (true class) could be compared to B (predicted) and D to C; or node A could be compared to both B and C, and node D to none; other pairings are also possible. Depending on the pairings, the score assigned to the classifier will be different. It seems reasonable to use the pairings that minimize the classification error. For example, in Figure 2(d) it could be argued that the prediction of B and C are based on evidence about A and thus both B and C should be compared to A.
2.3 Pairbased Measures
Pairbased measures assign costs to pairs of predicted and true classes. For example, in Figure 2(d) class B could be paired with A and class C with D, and then the sum of the corresponding costs would give the total misclassification error.
Let and be the sets of the predicted and true classes respectively, for a single test instance (the index of the instance is omitted due to simplicity). The sets and are augmented with a default predicted and a default true class, respectively corresponding to and . These classes are used when a predicted class cannot or should not be paired to any true class and viceversa. For example, when the distances between a predicted class and all the true classes exceed a predefined threshold (see the long distance problem in Figure 2(e)), the predicted class may be paired with the default true class.
Additionally, let be the cost of predicting class instead of the true class . The matrix , contains the costs of all possible pairs of predicted and true classes, including the default classes.
Pairbased measures typically calculate the cost of a pair of a predicted class and a true class as the minimum distance of and in the hierarchy, e.g. as the number of edges between the classes along the shortest path that connects them. The intuition is that the closer the two classes are in the hierarchy, the more similar they are, and therefore the less severe the error. More elaborate cost measures may assign weights to the hierarchy’s edges, and the weights may decrease when moving from the top to the bottom (Blockeel et al., 2002; Holden and Freitas, 2006). The distance to the default classes is usually set to a fixed large value.
In a spirit of fairness (minimum penalty), the aim of an evaluation measure is to pair the classes returned by a system and the true classes in a way that minimizes the overall classification error. This can be formulated as the following optimization problem:
Problem 1.
Constraint (i) states that , which denotes the alignment between classes, is either 0 (classes and are not paired) or 1 (classes and are paired); it furthermore states that the default predicted and true classes cannot be aligned (these default classes are solely used to “collect” those predicted and true classes with no counterpart). The parameters , (constraint (ii)) are the lower and upper bounds of the allowed number of true classes that a predicted class can be paired with. For example, setting requires each predicted class to be paired with exactly one true class. Similarly, the parameters , (constraint (iii)) limit the number of predicted labels that a true class can be paired with. The above constraints directly imply^{2}^{2}2Indeed, in the worst case, i.e. when all but are 0, constraint (ii) yields ; the reasoning is similar for constraint (iii). that and , meaning that the default true class can be aligned to at most predicted classes and the default predicted class to at most true classes.
The above problem corresponds to a best pairing problem in a bipartite graph, the nodes of which being respectively the predicted and true classes. It is important to note here that the pairing we are looking for is not a matching, since the same node can be paired with several nodes. We opt to approach this problem as a graph pairing one rather than a linear optimization one, for two reasons: first because there exist polynomial solutions to pairing problems in graphs, and second because the graph framework allows one to easily illustrate how the different costbased measures proposed so far relate to each other. In particular, we model it as a cost flow minimization problem (Ahuja et al., 1993).
2.3.1 A Flow Network Model for Class Pairing
A flow network is a directed graph with edges, where each edge is associated with a lower and an upper capacity denoted and respectively. The flow along an edge is denoted as and . The flow of the network is a vector . For each vertex , the flow conservation property holds:
where and denote the set of edges entering and leaving vertex respectively. Each edge is also associated with a cost which represents the cost of using this edge. The total cost of a flow is:
where . The minimum cost flow is the one that minimizes while satisfying the capacity and flow conservation constraints. The quantity to be minimized in flow networks is the same as the one in Problem 1, the constraints in this latter problem corresponding to capacity constraints, as explained below. Furthermore, the following integrality theorem states that when the bounds of capacity intervals are integers, there exists a minimal cost flow such that the quantity of flow on each edge is also an integer:
Integrality Theorem.
If a flow network has capacities which are all integer valued and there exists some feasible flow in the network, then there is a minimum cost feasible flow with an integer valued flow on every arc.
Furthermore, all standard algorithms for finding minimal cost flows guarantee to find this particular flow (Ahuja et al., 1993).
Pairing problems in bipartite graphs are represented with flow networks by adding two nodes, a source and a sink, and edges from the source to the first set of nodes, from the second set of nodes to the sink, and from the sink to the source. These extra nodes and edges ensure that the flow conservation constraints are satisfied. For pairbased measures, one thus obtains the following flow network framework (see also Figure 3):

includes a source, a sink, the predicted classes, the true classes, a default true class and a default predicted class;

includes edges from the source to all the predicted classes (including the default predicted class), from every predicted class to every true class (including the default true class), from every true class to the sink and from the sink to the source.
No edges exist between the default predicted and default true class, as required by constraint (i) above.
In our setting the capacity interval of an edge expresses the possible number of pairs that each predicted or true class can participate in. The interval between each pair of predicted and true classes restricts the flow on that network which indicates whether this pair will be considered in the calculation of the evaluation measure. Put it differently, in the solved flow network the flow values will reflect the specific evaluation measure as they show the pairs that make up the solution with the minimum cost. The intervals between the source and the predicted classes as well as between the true classes and the sink also affect the way that the pairing will be performed.
Due to the constraints in Problem 1, the capacity intervals are defined as follows:

From each predicted class to each true class , excluding the default class, the capacity interval is [0;1]; the integrality theorem here implies that the flow value between predicted and true classes will be either 0 or 1, i.e. a predicted and a true class either be paired (1) or not paired (0). The capacity bounds here correspond to the values of Problem 1 (constraint (i));

From the source to a (nondefault) predicted class, the capacity interval is meaning that a predicted class is aligned with at least (and at most ) true classes;

Similarly, from a (nondefault) true class to the sink, the capacity intervals is [;] meaning that a true class is aligned with at least (and at most ) predicted classes;

From each predicted class to the default true class the capacity interval is [0;] and from the default predicted class to each true class the capacity interval is [0;]; from the source (resp. sink) to the default predicted (resp. true) class, the capacity interval is (resp. ), as mentioned in footnote 2;

Lastly, from the sink to the source, the capacity interval is [0;], which corresponds to a loose setting compatible with the intervals given above (this last capacity interval does not impose any constraint but is necessary to ensure flow conservation).
2.3.2 Existing Pairbased Measures
The majority of the existing pairbased measures deals only with tree hierarchies and singlelabel problems. Under these conditions the pairing problem becomes simple, because a single path exists between the predicted and the true classes. The complexity of the problem increases when the hierarchy is a DAG or when the problem is multilabeled; current measures cannot handle the majority of the phenomena presented in Section 2.2.
In the simplest case of pairbased measures (Dekel et al., 2004; Holden and Freitas, 2006), the measure trivially pairs the single prediction with the single true label (), so that . Note that no default classes exist in this measure, or equivalently the corresponding costs are equal to infinity,
For a pair , of a predicted and a true class, depicted as and respectively in Figure 4, is taken to be the distance between and :
(1) 
where is the set of edges along the path from to in the hierarchy and is the weight of edge . For , we get what Dekel et al. (2004) call tree induced error.
In (Sun and Lim, 2001) two cost measures are proposed for multilabel problems in tree hierarchies, where all possible pairs of the predicted and true classes are used in the calculation. In this case, and . Again, no default classes are used and so the corresponding costs are: . Note that this is an extreme case, where all pairs of predicted and true labels are used. The weights
are calculated in two alternative ways: a) as the similarity (e.g., cosine similarity) between the classes of the predicted and true ones, and b) using the distances of the hierarchy as in Equation
1.A measure dubbed Graph Induced Error (GIE) was proposed and used during the second Large Scale Hierarchical Text classification challenge (LSHTC)^{3}^{3}3http://lshtc.iit.demokritos.gr/. GIE is based on the best matching pairs of predicted and true classes and can handle multilabel (and singlelabeled) classification with both tree and DAG class hierarchies. For a particular instance being classified, each predicted class is paired either with one true class or with the default true class; multiple predicted classes can be paired with the default true class (Figure 5). Similarly, each true class is paired with exactly one predicted class or with the default predicted class, and several true classes can be paired with the default predicted class. Hence, . The cost is computed as in Equation 1, with . If the hierarchy is a DAG, multiple hierarchy paths may link each predicted class to its paired true class ; then is taken to be the shortest of these paths. The cost of pairing a class (predicted or true) with a default one is set to a positive value . Figure 5 presents the corresponding flow network.
In multilabel classification GIE’s concept of “best” matching fails to address the pairing problem of Section 2.2. For example, if a predicted class has two true classes as children, as in Figure 2(d), then only one of them would be paired with its parent. The other one would either be penalized with or would be paired with another distant class.
2.3.3 Multilabel Graph Induced Accuracy
We propose here a straightforward extension of GIE called Multilabel Graph Induced Accuracy (MGIA), in which each class is allowed to participate in more than one pair. This extension makes the method more suitable to the pairing problem. Figure 6 presents the MGIA flow network, in which , , . The cost of pairing a class (predicted or true) with a default one is set as in GIE. Solving the flow network optimization problem is easy since the only constraints are that the default predicted class cannot be paired with the default true class and that categories of the same set (predicted or true) cannot be paired to each other. Thus each pairing can be solved separately from the others by pairing a class with either the default class of the other set, or the nearest class of the other set.
As in the previous pairbased measures, after the solution of the problem an error is calculated on the solved network. Instead of using directly this error for evaluation we define an accuracy based measure as follows:
where is the value provided by the solved flow network.
The above measure is bounded in [0,1] and the better a system is the closer it will be to 1. Note that in the case where all predicted classes and all true classes are paired with the respective default classes, will reach its maximum value and will be equal to the denominator as resulting in a value of 0. Essentially, the advantage of the proposed measure over other pairbased measures is that it takes into account the correct predictions of the classification system (that is the true positives, ).
2.4 Setbased Measures
The performance measures of this category are based on operations on the entire sets of predicted and true classes, possibly including also their ancestors and descendants, as opposed to pairbased measures, which consider only pairs of predicted and true classes.
Setbased measures have two distinct phases:

The augmentation of and with information about the hierarchy.

The calculation of a cost measure based on the augmented sets.
The augmentation of and is a crucial step, attempting to capture the hierarchical relations of the classes. For example, the sets may be augmented with the ancestors of the true and predicted classes as follows:
(2) 
(3) 
Using the augmented sets of predicted and true classes, two approaches have mainly been adopted to calculate the misclassification cost: a) symmetric difference loss and b) hierarchical precision and recall.
Symmetric difference loss is calculated as follows, where the cardinality of a set :
If we use the initial and sets instead of , the measure becomes the standard symmetric difference for flat multilabel classification. Also, note that the two quantities of the symmetric loss difference express the false positive and false negative rates respectively.
On the other hand, hierarchical precision and recall are defined as follows:
The nominator of these measures expresses the true positive rate and can be written as follows:
where we note that the symmetric loss is a substractive term.
Setbased measures are not affected by the pairing problem of Figure 2(d) and the long distance problem of Figure 2(e), as they do not rely on pairing of true and predicted classes.
2.4.1 Existing Setbased Measures
Different measures differ mainly in the way the sets of predicted and true classes are augmented. In (Kiritchenko et al., 2005; Struyf et al., 2005; Cai and Hofmann, 2007) the ancestors of the predicted and true classes are added to and , as in Equations 2 and 3 above. Alternatively, in Ipeirotis et al. (2001) the descendants of the true and predicted classes are added:
In the latter approach, when the true and predicted classes are in different subgraphs of the hierarchy (different subtrees, if the hierarchy is a tree), a maximum penalty will be given, even when several ancestors have been correctly predicted.
In (CesaBianchi et al., 2006), the approach that adds the ancestors is adopted (Equations 2 and 3) but the augmented sets are then altered as follows:
(4) 
(5) 
Equation 5 introduces some tolerance to overspecialization. Consider, for example, Figure 7 where we assume that the only true class is A and the only predicted class is C. According to Equation 3 we add class B (and class A) to . Based on Equation 5 we then remove C from to avoid penalizing the classification method for B and C. Similarly, with Equation 4 we tolerate underclassification. In Figure 7 the only true class is C and the only predicted class is A. According to Equation 2 class B (and A) are added to . Based on Equation 4 then we remove C from to avoid penalizing the classification method for both B and C. The drawback of this measure is that it tends to favor category systems that stop their predictions early in the hierarchy.
2.4.2 Lowest Common Ancestor Precision, Recall and Measures
The approach proposed in this paper is based on the hierarchical versions of precision, recall and , which add all the ancestors of the predicted and true classes to and . Adding all the ancestors has the undesirable effect of overpenalizing errors that happen to nodes with many ancestors.
In an attempt to address this issue, we propose the Lowest Common Ancestor Precision (), Recall () and () measures. These measures use the concept of the lowest common ancestor () as defined in graph theory (Aho et al., 1973).
Definition 1.
The lowest common ancestor of two nodes and of a tree is defined as the lowest node in (furthest from the root) that is an ancestor of both and .
For example, in Figure 8(a) = . In the case of a DAG the definition of changes. is a set of nodes (instead of a single node), since it is possible for two nodes to have more than one . Furthermore, the may not necessarily be the node that is furthest from the root. In order to define the LCA between two DAG nodes, we use the concept of the shortest path between them.
Definition 2.
Given a set containing all paths that connect nodes and , we define as the set for which:
where the cost of a path corresponds to its length, when the edges of the hierarchy are unweighted.
For example in Figure 8(b):
It is worth noting that in the general case is a set of paths; not a single one.
In multilabel classification, we would like to extend the definition of to compare a node (e.g. a true class) against a set of nodes (e.g. the predicted classes).
Definition 3.
The of a node and a set of nodes is the set of all the lowest common ancestors for each , where
For example, in Figure 8(b) and
is .
Given this definition and sets, being the true and the predicted classes of an instance, we compute the of each element of . Similarly for each element of by computing . Using Figure 8(b), let = {2.1, 3.2.1, 3.3} and = {3.1, 3.2.1, 3.2.2}. Then

, connecting 2.1 with either 3.1 using or 3.2.2 using .

, connecting 3.3 with either 3.1 using or 3.2.2 using .

, connecting 3.2.1 with itself.

, connecting 3.2.1 with itself.

, connecting 3.1 with either 2.1 using or 3.3 using .

, the first connecting 3.2.2 with 3.2.1 using
and the second connecting 3.2.2 with either 2.1 using or 3.3 using .
Additionally, we are interested in the sets containing all the LCA of each of the two sets.
Definition 4.
Given a set of true classes (nodes) and a set of predicted classes (nodes) , we define as the set containing all LCA() for all . Similarly we define as the set containing all LCA() for all .
In the above example {3, 3.2.1}, {3, 3.2, 3.2.1}.
Definition 5.
Given a set of true classes (nodes) , a set of predicted classes (nodes) and a set of , we define as the graph that contains:

all :

all subpaths of
Similarly is the graph that contains:

all :

all subpaths of
For example, for the and of figure 8(b) we get the and graphs of figure 9(a) and 9(b), respectively.
Based on these graphs the true and predicted sets of classes are augmented, in order for the setbased measures to be calculated. In the case of Figure 9, and . The next step is to calculate cost measures based on these two sets which in our case are the following:
In the example of Figure 9 all three measures, , and , between sets and , are 0.6. We prefer this approach over the symmetric difference loss, since it takes into account the TP in addition to FP and FN. Ignoring TP leads systems to prefer predicting fewer categories, since missing a single FP usually costs more than the gain of finding an extra TP. This behavior is also observed in the results of real systems, (see section 4) and is considered undesirable.
The two graphs and were created using all nodes of and and all corresponding paths. However, subgraphs of the two graphs and , could be selected that would connect each node of with an and vice versa. For example, in Figure 8(b) node 3.2.2 has two LCAs, node 3.2 and 3. Node 3.2 could be removed from and , without breaking the condition of any node in with an LCA or vice versa. We would then get graphs and of Figure 10. , and , between the reduced sets and of Figure 10, are 0.5 instead of 0.6 (Figure 9).
In other words graphs and should comprise the nodes necessary for connecting the nodes of the two sets, through their LCAs. Redundant nodes can lead to fluctuations in , and , and should be removed. In order to obtain the minimal LCA graphs, we have to solve the following maximization problem:
Problem 2.
Minimal LCA graph extension.