In probabilistic multi-class classification, one often encounters situations in which the classifier is uncertain about the class label for a given instance. In such cases, instead of predicting a single class, it might be beneficial to return a set of classes as a prediction, with the idea that the correct class should at least be contained in that set. For example, in medical diagnosis, when not being sure enough about the true disease of a patient, it is better to return a set of candidate diseases. Provided this set is sufficiently small compared to the total number of diagnoses, it can still be of great help for a medical doctor, because only the remaining candidate diseases need further investigation.
Formally, we assume training examples from a distribution on , with some instance space (e.g., images, documents, etc.) and an output space consisting of classes. In the context of set-valued prediction, we will consider a prediction from the power set of , i.e., predictions are (non-empty) subsets of , or more formally, . The quality of the prediction can be expressed by means of a set-based utility score , where corresponds to the ground-truth class and is the predicted set.
Multi-class classifiers that return set-valued predictions have been considered by several authors under different names. In [Coz et al., 2009], the notion of non-deterministic classification is introduced, and the performance of set-valued classifiers is assessed using set-based utility scores from the information retrieval community, such as precision, recall, and the F-measure. Unlike typical top- prediction in information retrieval, the cardinality of the set varies over instances in set-valued prediction, depending on the uncertainty about the class label for an instance. Other researchers call the same setting credal or cautious classification. In a series of papers, they analyze several set-based utility scores that reward abstention in cases of uncertainty [Corani and Zaffalon, 2008, 2009, Zaffalon et al., 2012, Yang et al., 2017b]. The framework of conformal prediction also produces set-valued predictions, albeit with a focus on confidence (the set covers the true class with high probability) and less on utility [Shafer and Vovk, 2008]. Furthermore, set-valued prediction can be seen as a generalization of multi-class classification with a reject option [Ramaswamy et al., 2015a], where one either predicts a single class or the complete set of classes.
In this paper, we introduce Bayes-optimal algorithms for maximizing set-based utility scores
. To this end, we will consider a decision-theoretic framework, where we estimate a probabilistic model, followed by an inference procedure at prediction time. In a probabilistic multi-class classification framework, we estimate the conditional probability distributionover :
This distribution can be estimated using several types of probabilistic methods (see further below). At prediction time, the goal is to find the Bayes-optimal solution by expected utility maximization:
where we introduce the shorthand notation for the expected utility. The above problem is a non-trivial optimization problem, as a brute-force search requires checking all subsets of , resulting in an exponential time complexity. However, we will be able to find the Bayes-optimal prediction more efficiently. As our main contribution, we present several algorithms that solve (2) in an efficient manner. Applying additional search space restrictions, those algorithms will typically exhibit a trade-off between runtime and predictive performance.
In Section 2, we discuss essential properties of the utility scores we consider, and review existing scores covered by our framework. In Section 3, we present two theoretical results as essential building blocks for solving (2), followed by three algorithms in Section 4. We first introduce a simple algorithm (Section 4.1) that makes queries to the conditional distribution (with the number of classes). In Section 4.2, we introduce an algorithm that makes less than calls to , provided the conditional class probabilities are modelled with hierarchical models (e.g. hierarchical softmax). In Section 4.3, we claim that the Bayes-optimal predictor can be computed even more efficiently when a restriction is made to sets of classes that occur as nodes in a predefined hierarchy. In Section 5, we further discuss the trade-offs between runtime and predictive performance of the different algorithms by means of experiments on benchmark datasets.
2 Utility scores for set-valued prediction
The inference algorithms that we introduce can be applied to a general family of set-based utility functions given as follows:
where denotes the cardinality of the predicted set . This family is parametrized by a sequence that should obey the following properties:
, i.e., the utility should be maximal when the classifier returns the true class label as a singleton set;
should be non-increasing, i.e., the utility should be higher if the true class is contained in a smaller set of predicted classes;
, i.e., the utility of predicting a set containing the true and additional classes should not be lower than the expected utility of randomly guessing one of these classes. This requirement formalizes the idea of risk-aversion: in the face of uncertainty, abstaining should be preferred to random guessing (see e.g. [Zaffalon et al., 2012]).
Many existing set-based utility scores are recovered as special cases of , including the three classical measures from information retrieval adopted in [Coz et al., 2009]: precision with , recall with , and the F-measure with . Other utility functions with specific choices for are studied in the literature on credal classification [Corani and Zaffalon, 2008, 2009, Zaffalon et al., 2012, Yang et al., 2017b, Nguyen et al., 2018], such as
Especially is commonly used in this community, where and can only take certain values to guarantee that the utility is in the interval . Precision (here called discounted accuracy) corresponds to the case . However, typical choices for ) are and [Nguyen et al., 2018], implementing the idea of risk aversion. The measure is an exponentiated version of precision, where the parameter also defines the degree of risk aversion.
Another example appears in the literature on multi-class classification with reject option [Ramaswamy et al., 2015a]. In this case, the prediction can only be a singleton or the full set containing classes. The first case typically gets a reward of one, while the second case should receive a lower reward, e.g. . This second case corresponds to abstaining, i.e., not predicting any class label, and the (user-defined) parameter specifies a penalty for doing so, with the requirement to be risk-averse. To include sets of any cardinality , the utility could be generalized as follows:
with the same interpretation for , and a second parameter that defines whether is convex or concave. While convexity (like in most of the above utility functions) appears natural in most applications, a concave utility might be useful when predicting a large set is tolerable. In the limit, when , we obtain the simple utility function for classification with reject option. In the supplementary material (Fig. 1), we plot for some of the above parameterizations.
Set-valued predictions are also considered in hierarchical multi-class classification, mostly in the form of internal nodes of the hierarchy [Alex Freitas, 2007, Rangwala and Naik, 2017, Yang et al., 2017a]. Compared to the “flat” multi-class case, the prediction space is thus restricted, because only sets of classes that correspond to nodes of the hierarchy can be returned as a prediction. Some of the above utility scores also appear here. For example, [Yang et al., 2017a] evaluate various members of the family in a framework where hierarchies are considered for computational reasons, while [Oh, 2017] optimizes recall by fixing as a user-defined parameter. Popular in hierarchical multi-label classification is the tree-distance loss, which could also be interpreted as a way of evaluating set-valued predictions (see e.g. [Bi and Kwok, 2015]). This loss is not a member of the family (3
), however. Besides, from the perspective of abstention in case of uncertainty, it is a useless loss function (see supplementary material).
3 Theoretical results
In this section, we present two theoretical results as building blocks of the algorithms that we consider later on. The formulation in (2) seems to suggest that all subsets of need to be analyzed to find the Bayes-optimal solution, but a first result shows that this is not the case.
The exact solution of (2) can be computed by analyzing only subsets of .
With , the expected utility can be written as
where the last summation in the second equality cancels out since . Let us decompose (2) into an inner and an outer maximization step. The inner maximization step then becomes
for . This step can be done very efficiently, by sorting the conditional class probabilities, as for a given , only the subset with highest probability needs to be considered. The outer maximization simply consists of computing , which only requires the evaluation of sets. ∎
So, one only needs to evaluate to find the Bayes-optimal solution, which already limits the search to subsets. Can one do even better? It turns out that by restricting , we can assure that the sequence is unimodal. The restriction required for is -convexity, i.e., convexity after a transformation.
A sequence is -convex if
Let be a -convex sequence and let for a given . Then for all .
We refer to the supplementary material for a proof. Actually, -convexity is a rather weak restriction. One can easily prove that every decreasing, concave sequence is -convex (see Theorem 7 in the supplementary material). Many (but not all) convex sequences are also -convex. In particular, precision, recall, the F-measure, as well as the and families for practically useful values of , , and , are all utilities with associated -convex sequences. Precision with in fact defines “how convex” a sequence is allowed to be, because (7) is satisfied as an equality in that boundary case. At the same time, precision is also a boundary case for the properties of discussed in Section 2.
4 Algorithmic results
In this section, we introduce three approaches for solving (2), each consisting of a learning and an inference algorithm. To analyze their complexity, we use a framework similar to learning reductions [Beygelzimer et al., 2016]. We assume that, during prediction, we can query the conditional class distribution for a given . For each approach, we define a class of queries of complexity (e.g., in our first approach, one query consists of computing for a particular class
, which for linear models boils down to computing a dot product between model weights and a feature vector). The motivation for this framework is that query operations significantly dominate all other steps of the presented algorithms. For example, although the inference algorithms of all three approaches have antime complexity, they will exhibit clear differences in runtime.
We obtain the distribution by running a learning algorithm on the training data. To analyze the complexity of this learning algorithm, we also define a basic operation of cost per single training example (e.g. one stochastic gradient update for a class). To simplify the analysis, we ignore the accuracy of the distribution compared to the true underlying distribution.
4.1 Unrestricted Bayes-optimal prediction (UBOP)
In the first approach, we assume that a single query to yields for a particular class . To train to answer such queries, we usually use an example from class to also train conditional probabilities for the other classes (e.g., in the 1-vs-all approach). Therefore, the learning complexity with respect to a single training example is . By combining Theorems 1 and 2, we get the inference procedure presented in Algorithm 1. We start by querying the distribution to get conditional class probabilities and sort them in decreasing order. Then, the algorithm computes , …, in a sequential way, till the stopping criterion of Theorem 2 is satisfied. In step , (6) is found by adding the class with -highest conditional class probability to the predicted set, starting from the solution obtained in the previous step. The algorithm is a generalization of an algorithm introduced by [Coz et al., 2009] for optimizing the F-measure in multi-class classification. There is also a strong correspondence with certain F-maximization algorithms in multi-label classification (see e.g. [Jansche, 2007, Ye et al., 2012, Waegeman et al., 2014]). To be compliant with the other algorithms, we state a simple result that obviously follows from sorting the conditional class probabilities.
Algorithm 1 finds the Bayes-optimal solution with queries to the distribution .
4.2 Unrestricted Bayes-Optimal prediction with class hierarchy (UBOP-CH)
Algorithm 1 is simple and elegant, but needs queries to the distribution . As only classes with high probability mass are required, one may think of designing a procedure that queries the top classes one by one, without the need of computing conditional probabilities for all classes. To accomplish this, one could for example adapt specialized data structures used for the approximate nearest neighbors problem [Yagnik et al., 2011, Shrivastava and Li, 2014, Johnson et al., 2017]. Instead, we factorize the distribution in a hierarchical way, leading to a tree structure that allows for efficient querying. Factorizing the distribution has the additional advantage of an improved training time. This approach in fact underlies many popular algorithms, such as nested dichotomies [Fox, 1997, Frank and Kramer, 2004], conditional probability estimation trees [Beygelzimer et al., 2009], probabilistic classifier trees [Dembczyński et al., 2016], or hierarchical softmax [Morin and Bengio, 2005]
, often used in neural networks as the output layer.
Let us consider a hierarchical structure over the classes, denoted . Let be the set of nodes. In this structure, each node corresponds to a particular non-empty subset of , the root represents the complete set , and the leaves are individual classes, i.e., singleton sets. For a given node , let be a set of its children and its parent. We have that . Let us also define as the set of leaves of the subtree rooted in . The degree of node , denoted by , is the number of children of that node. The depth of node , denoted by , is the number of edges on a path from to the root. By and , we denote, respectively, the degree and the depth of , i.e., and .
We can express the conditional probability of
recursively by using the chain rule of probability in the following way:
in which depicts the probability of choosing node in the parent node , with the property that . Remark that, for the leave nodes, we obtain , , i.e., the conditional probabilities of classes.
With such a tree, we assume that queries correspond to probabilities , in contrast to Section 4.1, where queries corresponded to the probabilities . As before, the time complexity of the query is . The learning algorithm uses a training example whenever it is helpful for training , i.e., when the class of the example is in . Since the number of child nodes can be greater than 2, this leads to a multi-class problem. A training example is used in all such problems on the path from the root to the leaf corresponding to . The final cost of training for a single training example is thus . For balanced trees of a constant degree, this gives a logarithmic complexity in the number of classes.
The tree structure combined with Theorem 2 leads to the inference procedure given as Algorithm 2. This -style algorithm is closely related to search methods used with probabilistic tree classifiers [Dembczyński et al., 2012, 2016, Mena et al., 2017]. The algorithm computes , till the stopping criterion in Theorem 2 is reached. By traversing the tree recursively from the root, we only visit the most promising nodes while pruning a large part of the tree. The algorithm uses a priority queue to store all previously visited nodes in descending order of their probabilities , computed via (8). In each step, the node with the highest probability is popped from the list and its child nodes are evaluated. This implies that the leaves are visited in decreasing order of . Each time a leaf is visited, the expected utility is evaluated on the set of classes that correspond to the leaves that are visited so far.
Prior to the main theorem (as a counterpart to Theorem 3), we present the following lemma.
To visit (i.e., pop the node from queue ), with probability , Algorithm 2 requires at most queries to .
The proof (see Appendix D) relies on an analysis of the number of nodes in the subtree of that consists of nodes with probabilities greater than or equal to . The key insight is that the number of leaves in this subtree is upper bounded by .
Algorithm 2 finds the Bayes-optimal solution with at most queries to distribution , where .
The theorem is proven by using Lemma 1 and noticing that the inference procedure stops after visiting a leaf with the highest probability, whose class is not included in (see Appendix D). In the worst case, the number of calls to can be as high as , but it will usually be much smaller than . More precisely, for , the number of queries is less than . For example, this boils down to for complete binary trees.
4.3 Restricted Bayes-optimal prediction with class hierarchy (RBOP-CH)
This approach uses the same distribution as in the previous section. Therefore, the training complexity as well as the complexity of a single query are the same. However, we slightly change the definition of the problem by restricting the set of possible set-valued predictions to the nodes in , i.e., .111A similar inference algorithm could be used with the distribution considered in Section 4.1, by first getting for all and then computing the node probabilities by simple summation. This approach, however, uses queries to the distribution. This further improves the complexity of Algorithm 2. In addition to runtime gains, limiting the search to sets that naturally appear as nodes in the hierarchy also greatly reduces the “semantic complexity” for problems where a correctly specified hierarchy exists. When a domain expert constructs a hierarchy, he/she is often only interested in analyzing those sets, which correspond to groups of classes that belong together, and therefore have a clear interpretation. This is a second motivation for restricting the search to particular subsets.
Adopting the same reasoning as in (3), the utility maximization problem now becomes
Algorithm 3 solves this restricted optimization problem in an exact manner. It modifies Algorithm 2 with regard to two aspects: (a) After popping a node from , it immediately evaluates the expected utility of the subset that corresponds to that node, and (b) it stops as soon as it encounters a leaf. Due to modification (a), only sets of classes that are associated with the nodes in are considered as candidate predictions. The second modification explains why Algorithm 3 should be faster than Algorithm 2. The following result states the correctness of the algorithm and an upper bound of the number of queries to the distribution , which in many cases is significantly lower than .
To show the correctness, our proof (Section E in the Appendix) makes use of a decomposition into an inner and outer maximization, similar to Theorem 1. The key observation is that, among the nodes with the same number of classes, the algorithm always visits a node with highest probability first. To show the upper bound for the number of queries, we use Lemma 1 and the fact that the algorithm stops after reaching the first leaf. Notice that the ratio of costs between Algorithm 3 and Algorithm 2 can be characterized by the ratio between and from Theorem 4. Moreover, in the worst case, the number of queries of Algorithm 3 is .
5 Empirical results
We experimentally compare the UBOP, UBOP-CH, and RBOP-CH approaches on seven benchmark datasets with a predefined hierarchy over the classes, because RBOP-CH is only useful when such a hierarchy exists. We analyze four image classification datasets: VOC 2006222The multi-label VOC datasets are preprocessed by removing instances with more than one label., VOC 20072, Caltech-101, and Caltech-256 [Everingham et al., 2006, 2007, Li et al., 2003, Griffin et al., 2007], one biological enzyme dataset [Li et al., 2018], and two large-scale text classification datasets LSHTC1 and DMOZ [Partalas et al., 2015]2017]) and trained end-to-end on a GPU. For text classification, we use bag-of-words hidden representations assumed to be fixed and given throughout the experiments. For training the probabilistic model, as well for the implementation of the inference algorithms, we use C++ (Liblinear library [Fan et al., 2008]) to assure a fair comparison between all classifiers. We estimate the probabilistic model in (8
) by means of a hierarchical softmax layer that utilizes the predefined hierarchy. A more in-depth description of the experimental setup can be found in SectionG in the supplementary material.
In Table 1, we report results for UBOP, UBOP-CH, and RBOP-CH, optimized on three convex utility scores, listed in decreasing order of convexity: , , and . We report the predictive performance, the runtime and the size of the predicted sets for these utility scores. All scores are averages over the test set. We also perform inference for the mode of as a reference, and report top-1 accuracy in that case. In addition, we show some relevant statistics regarding the benchmark datasets, such as the number of classes and sample sizes.
Unsurprisingly, imposing a hierarchy over the classes leads to a clear gain in training time, due to factorizing the distribution . In addition, clear test time gains are observed when using UBOP-CH, which is in accordance with our theoretical results. Moreover, the efficiency becomes even higher when further restricting the search space. For RBOP-CH, the test time is identical to inference for the mode, with small differences due to separate runs.
However, imposing a hierarchy comes at the expense of top-1 accuracy for most datasets. In case of less convex utility scores, the algorithms typically return bigger prediction sets, resulting in a higher recall for all algorithms. This clearly illustrates the trade-off between precision and correctness that comes with set-based predictions. When comparing the classifiers with a hierarchical probabilistic model, it is clear that restricting the search space during inference (RBOP-CH) results in predictive performance drops. For the image datasets, UBOP seems to yield the best predictive performance. This can be attributed to several factors, such as the relatively low number of classes, informative hidden representations that facilitate the classification tasks, or the use of less meaningful hierarchies. In contrast, UBOP-CH and RBOP outperform UBOP on the datasets with large . In these cases, rather simple hidden representations, together with a high class imbalance, make the classification problem much harder. Here, factorizing the probability distribution by means of a meaningful hierarchy is beneficial. In Section F of the supplementary material, we present additional theoretical results that shed more light on the price one might pay by factorizing the probability distribution.
We introduced a decision-theoretic framework for a general family of set-based utility functions, including most of the measures used in the literature so far, and developed three Bayes-optimal inference algorithms that exploit specific assumptions to improve runtime efficiency. Depending on the concrete dataset, those assumptions may or may not affect predictive performance.
In future work, we plan to extend our decision-theoretic framework toward uncertainty representations more general than standard probability, for example taking up a distinction between so-called aleatoric and epistemic uncertainty recently put forward by several authors [Senge et al., 2014, Kendall and Gal, 2017, Depeweg et al., 2018, Nguyen et al., 2018].
- Alex Freitas  A. d. C. Alex Freitas. A tutorial on hierarchical classification with applications in bioinformatics. In Research and Trends in Data Mining Technologies and Applications,, pages 175–208, 2007.
Beygelzimer et al. 
A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl.
Conditional probability tree estimation analysis and algorithms.
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 51–58, Arlington, Virginia, United States, 2009. AUAI Press.
- Beygelzimer et al.  A. Beygelzimer, H. D. III, J. Langford, and P. Mineiro. Learning reductions that really work. Proceedings of the IEEE, 104(1):136–147, 2016.
- Bi and Kwok  W. Bi and J. Kwok. Bayes-optimal hierarchical multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 27:1–1, 11 2015.
- Corani and Zaffalon  G. Corani and M. Zaffalon. Learning reliable classifiers from small or incomplete data sets: The naive credal classifier 2. Journal of Machine Learning Research, 9:581–621, 2008.
- Corani and Zaffalon  G. Corani and M. Zaffalon. Lazy naive credal classifier. In Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, pages 30–37. ACM, 2009.
- Coz et al.  J. J. D. Coz, J. Díez, and A. Bahamonde. Learning nondeterministic classifiers. The Journal of Machine Learning Research, 10:2273–2293, 2009.
- Dembczyński et al.  K. Dembczyński, W. Waegeman, W. Cheng, and E. Hüllermeier. An analysis of chaining in multi-label classification. In Proceedings of the European Conference on Artificial Intelligence, 2012.
- Dembczyński et al.  K. Dembczyński, W. Kotłowski, W. Waegeman, R. Busa-Fekete, and E. Hüllermeier. Consistency of probabilistic classifier trees. In ECML/PKDD, 2016.
Depeweg et al. 
S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft.
Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning.In ICML, volume 80 of Proceedings of Machine Learning Research, pages 1192–1201. PMLR, 2018.
- Everingham et al.  M. Everingham, A. S. M. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2006 (VOC2006) Results, 2006.
- Everingham et al.  M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 2007.
- Fan et al.  R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
Applied regression analysis, linear models, and related methods. Sage, 1997.
- Frank and Kramer  E. Frank and S. Kramer. Ensembles of nested dichotomies for multi-class problems. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pages 39–, New York, NY, USA, 2004. ACM.
- Griffin et al.  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology, 2007.
- Jansche  M. Jansche. A maximum expected utility framework for binary sequence labeling. In Association for Computational Linguistics, pages 736–743, 2007.
- Johnson et al.  J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
Kendall and Gal 
A. Kendall and Y. Gal.
What uncertainties do we need in bayesian deep learning for computer vision?In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5580–5590, 2017.
- Li et al.  F.-F. Li, M. Andreetto, and M. A. Ranzato. Caltech101 image dataset. Technical report, California Institute of Technology, 2003.
- Li et al.  Y. Li, S. Wang, R. Umarov, B. Xie, M. Fan, L. Li, and X. Gao. Deepre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics, 34(5):760–769, 2018.
Mena et al. 
D. Mena, E. Montañés, J. R. Quevedo, and J. J. del Coz.
A family of admissible heuristics for A* to perform inference in probabilistic classifier chains.Machine Learning, pages 1–27, 2017.
- Morin and Bengio  F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, pages 246–252. Society for Artificial Intelligence and Statistics, 2005.
- Nguyen et al.  V. Nguyen, S. Destercke, M. Masson, and E. Hüllermeier. Reliable multi-class classification based on pairwise epistemic and aleatoric uncertainty. In IJCAI, pages 5089–5095. ijcai.org, 2018.
- Oh  S. Oh. Top-k hierarchical classification. In AAAI, pages 2450–2456. AAAI Press, 2017.
- Partalas et al.  I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artières, G. Paliouras, É. Gaussier, I. Androutsopoulos, M. Amini, and P. Gallinari. LSHTC: A benchmark for large-scale text classification. CoRR, abs/1503.08581, 2015.
- Paszke et al.  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- Ramaswamy et al. [2015a] H. G. Ramaswamy, A. Tewari, and S. Agarwal. Consistent algorithms for multiclass classification with a reject option. CoRR, abs/1505.04137, 2015a.
- Ramaswamy et al. [2015b] H. G. Ramaswamy, A. Tewari, and S. Agarwal. Convex calibrated surrogates for hierarchical classification. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 1852–1860. JMLR.org, 2015b.
- Rangwala and Naik  H. Rangwala and A. Naik. Large scale hierarchical classification: foundations, algorithms and applications. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2017.
- Senge et al.  R. Senge, S. Bösner, K. Dembczyénski, J. Haasenritter, O. Hirsch, N. Donner-Banzhoff, and E. Hüllermeier. Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncertainty. Information Sciences, 255:16–29, 2014.
- Shafer and Vovk  G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9:371–421, 2008.
- Shrivastava and Li  A. Shrivastava and P. Li. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2321–2329, Cambridge, MA, USA, 2014. MIT Press.
- Simonyan and Zisserman  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Waegeman et al.  W. Waegeman, K. Dembczyński, A. Jachnik, W. Cheng, and E. Hüllermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research, pages 3333–3388, 2014.
- Yagnik et al.  J. Yagnik, D. Strelow, D. A. Ross, and R. sung Lin. The power of comparative reasoning. In 2011 International Conference on Computer Vision, pages 2431–2438, Nov 2011.
- Yang et al. [2017a] G. Yang, S. Destercke, and M.-H. Masson. Cautious classification with nested dichotomies and imprecise probabilities. Soft Computing, 21:7447–7462, 2017a.
- Yang et al. [2017b] G. Yang, S. Destercke, and M.-H. Masson. The costs of indeterminacy: How to determine them? IEEE Transactions on Cybernetics, 47:4316–4327, 2017b.
- Ye et al.  N. Ye, K. Chai, W. S. Lee, and H. L. Chieu. Optimizing f-measures: a tale of two approaches. In Proceedings of the International Conference on Machine Learning, 2012.
- Zaffalon et al.  M. Zaffalon, C. Giorgio, and D. D. Mauá. Evaluating credal classifiers by utility-discounted predictive accuracy. Int. J. Approx. Reasoning, 53:1282–1301, 2012.
Appendix A Interpretation of different utility scores
In this section we provide a tabular overview of the different utility scores discussed in the main paper, with an indication of the corresponding form for . In Figure 1 we also plot for the different utility scores.
Appendix B Additional discussion on the tree distance loss
The tree distance loss is defined as the path distance between the true class and the prediction in the hierarchy. In the main text we claim that the Bayes-optimal predictor for the tree distance loss always returns the smallest subtree with probability at least . This result was proven by [Ramaswamy et al., 2015b], albeit for a slightly different framework, where the ground truth can also be an internal node of the hierarchy. Below we show that the same result also applies to our framework. Consequently, the tree distance loss is not appealing from an abstention perspective, because it does not have a mechanism for controlling the size of a set . In general, this loss will return rather large sets as prediction.
For the theorem and the proof we use some of the notations that are introduced in the main paper. Let us additionally define the Bayes-optimal prediction for the tree-distance loss as the minimizer of the expected risk:
where the expected risk is given by
The solution of optimization problem (10) is the prediction that satisfies the following two properties:
Assume that we search for the node for which is minimal and having . This subtree will be denoted . It can be found by greedy search, using an approach that is very similar to Algorithm 2, starting from the root of the hierarchy. If , then the theorem trivially holds. Therefore, assume . We will prove that this leads to a contradiction. To this end, let us denote the regret of , w.r.t. to the tree-distance loss , by:
and let be the set of descendants of node . Now, let us consider three cases:
Case 1: )
and, hence, giving a contradiction.
Let be the child of that is an ancestor of . By construction it holds that , such that