Taxonomy grounded aggregation of classifiers with different label sets

by   Amrita Saha, et al.

We describe the problem of aggregating the label predictions of diverse classifiers using a class taxonomy. Such a taxonomy may not have been available or referenced when the individual classifiers were designed and trained, yet mapping the output labels into the taxonomy is desirable to integrate the effort spent in training the constituent classifiers. A hierarchical taxonomy representing some domain knowledge may be different from, but partially mappable to, the label sets of the individual classifiers. We present a heuristic approach and a principled graphical model to aggregate the label predictions by grounding them into the available taxonomy. Our model aggregates the labels using the taxonomy structure as constraints to find the most likely hierarchically consistent class. We experimentally validate our proposed method on image and text classification tasks.



There are no comments yet.


page 1

page 2

page 3

page 4


Hierarchical Taxonomy-Aware and Attentional Graph Capsule RCNNs for Large-Scale Multi-Label Text Classification

CNNs, RNNs, GCNs, and CapsNets have shown significant insights in repres...

A review of learning vector quantization classifiers

In this work we present a review of the state of the art of Learning Vec...

On the Evaluation of Military Simulations: Towards A Taxonomy of Assessment Criteria

In the area of military simulations, a multitude of different approaches...

One-Class Classification: Taxonomy of Study and Review of Techniques

One-class classification (OCC) algorithms aim to build classification mo...

On Classification with Bags, Groups and Sets

Many classification problems can be difficult to formulate directly in t...

An extensible cluster-graph taxonomy for open set sound scene analysis

We present a new extensible and divisible taxonomy for open set sound sc...

SIFT: An Algorithm for Extracting Structural Information From Taxonomies

In this work we present SIFT, a 3-step algorithm for the analysis of the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In several real-world classification problems (for example visual object recognition [12], text categorization [9, 19], web content classification [5], US Patent codes, ICD [1] codes of diseases [15] etc.) the classes to be predicted are naturally organized into a large pre-defined class hierarchy or a class taxonomy—typically a tree or a DAG (Directed Acyclic Graph). However most state-of-the-art results are obtained with flat classifiers which typically ignore the class hierarchy and treat each class separately. It could be that the taxonomy was not available while training the classifiers or that it was not explicitly used. These hierarchy agnostic flat classifiers can be either multi-class, multi-label, or binary classifiers possibly trained on subsets of the classes. In this paper we address the problem of aggregating the output of multiple such flat classifiers using the pre-defined class taxonomy, since several applications need grounded references to such background taxonomies.

As a motivating example we consider the task of visual object recognition [12, 7]

—given an image (or a region in the image) the task is to predict the most likely object in the image. Considerable amount of progress has been achieved in the computer vision community and various pre-trained state-of-the-art flat classifiers are available. Since the number of possible objects is quite large these classifiers are trained with different datasets and class labels

111For example the CIFAR-100 dataset [11] has 100 class labels, the PASCAL-VOC datset [6]

has 20 class labels, and the latest ImageNet ILSVRC challenge dataset 

[21] has 1000 class labels.

(not necessarily mutually exclusive). Considerable amount of research effort and also CPU time(especially in the case of convolutional neural network based approaches 

[12, 24] which takes weeks to months to train) has been spent on training these classifiers. Hence we are interested in reusing multiple such classifiers on a given image, aggregating the scores and predicting the final label. However combining such pre-trained flat classifiers poses its own set of challenges which we list below and address in this paper.

Different class labels—The classifiers are generally trained with different class labels based on the labeled dataset it was trained on. For example we may have one classifier trained with a label set while another classifier could have been trained with the label set . We will use the class taxonomy in order to ground the different label sets into a common space. For this domain we use the Wordnet [16] 222WordNet [16] is a large lexical database of english where nouns are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept (around 80k synsets). Synsets are interlinked by means of conceptual-semantic and lexical relations. The most frequently encoded relation among synsets is the super-subordinate relation (also called hyperonymy, hyponymy or IS-A relation). It links more general synsets like to increasingly specific ones like . , which is a DAG structured class taxonomy organized by the hypernym-hyponym hierarchy. The Wordnet encodes our prior world knowledge that is a and is an .

Hierarchically inconsistent predictions—Since the flat classifiers are trained ignoring the taxonomy, the class predictions may not be hierarchically consistent. Consider, for example, a classifier trained with class labels . Since the class labels are not mutually exclusive, it is possible that the classifier may give a high score for and a low score for , though the taxonomy implies that a is a and hence should also receive a high score. The same problem persists across different classifiers—if one classifier predicts an instance as a and another as , we need to aggregate the two classifiers in a hierarchically consistent way.

Different classifier accuracies

—The individual classifiers can have (unknown) different accuracies, which have to be accounted for when aggregating the classifiers. Moreover the reported accuracies are based on the classes on which it was trained on and also ignore the hierarchy. We model the classifier performance using the taxonomy and estimate the accuracies using a validation set.

Depth of the predicted class label—By using the class taxonomy we can actually predict a class label which is not in the label sets used to train the classifiers. We generalize the notion of the label of an instance to a path in the taxonomy. The path starts at the root of the taxonomy and terminates at any class (not necessarily the leaf node) in the taxonomy. We propose strategies to decide where to terminate the path to give a final prediction.

Related work—There is a rich literature in the area of hierarchical classification (see [23] for a survey), which deals with training classifiers by explicitly accounting for the class hierarchy. However in this paper we are primarily concerned with aggregating pre-trained flat classifiers in a hierarchically consistent way. We do not attempt to re-train any classifiers using the taxonomy.

The proposed algorithms can also be used to aggregate (hierarchical) labels collected from multiple annotators via crowdsourcing. While sophisticated techniques exist for binary, categorical and ordinal labels [20] to the best of our knowledge there are no methods for labels defined by a taxonomy.

Another area of research related to our problem setting is that of integrating (or mapping) label-sets into each other as in the case of e-commerce catalog integration [2, 22, 18]. In catalog integration the problem is mapping a source product taxonomy (of a seller) containing textual descriptions of products on to a target master taxonomy (of the ecommerce site). Various techniques for this involve jointly learning class mappings with or without data labeled with both label-sets, and estimating and re-training the master taxonomy using statistics and/or structure of the source taxonomy. The specific formulation we propose is quite different from such a label-set mapping problem since we want to lazily combine only the predictions of differently trained classifiers into a background taxonomy. As in our motivating example of mapping object classification predictions to WordNet, we want to re-use the effort spent in creating and tuning existing classifiers and view them through the lens of WordNet to naturally describe objects in images as per its world knowledge.

In our problem setting the constituent classifiers are free to evolve and change, and have their own area of expertise. Thus ensemble meta-learning methods that learn accuracy estimates or dynamic model selection techniques do not apply to our setting.

Organization—§ 2 introduces the notation and the problem statement. We formally define the notion of class taxonomy and hierarchical consistency in a taxonomy. § 3 presents a heuristic solution based on propagating scores in the class taxonomy and generalizes the notion of the label of an instance to a path in the taxonomy and specifies when to terminate the path. § 4 presents the proposed graphical model. In § 5 we discuss some extensions and present an EM algorithm to estimate the parameters of the graphical model without having access to any validation set. § 6 experimentally validates the various approaches on a visual object recognition and a text classification dataset.

2 Notation and problem statement

Class Taxonomies—In this paper we are concerned with classification problems where the class labels are organized hierarchically into class taxonomies. The hierarchy imposes a parent-child IS-A relation among the classes—an instance belonging in a specific class, also belongs in all its ancestor classes. Formally a class taxonomy [23] is defined as a pair , where is a finite set of classes organized hierarchically with the IS-A relationship . For any two classes and the relation means that is a sub-class of . The relationship satisfies the following properties: (1) Asymmetry If then , . (2) Anti-reflexivity , . (3) Transitivity If and then , . In graphs with cycles, only the transitivity property holds. In this paper we consider only hierarchies without cycles. The parent of a class is denoted as and its child is denoted as . The descendants and ancestors are denoted and respectively.

The taxonomy can either be a tree (where the classes have a single parent) or a directed acyclic graph(DAG) (where the classes can have multiple parents). The Wordnet is a DAG taxonomy organized by the hypernym-hyponym hierarchy (See Figure 1 for an illustration).

[animal [domestic animal[dog,name=dog,tier=entry[working dog[watch dog[pinscher[doberman]]][shepherd dog[rottweiler]]][hunting dog[hound[bluetick]]]]] [carnivore,name=carnivore[canine,name=canine[fox,tier=entry]][feline[cat,tier=entry[domestic cat][wild cat]]]] ] [-] (dog) to (canine);

Figure 1: A small subset of the Wordnet illustrating the IS-A DAG taxonomy. In this example has two parents and .

Flat pre-trained classifiers on taxonomy subsets—We have a set of trained classifiers . Each classifier is trained using a subset of classes . We work with the assumption that these classifiers are flat multi-class classifiers 333The classifier can also be a binary classifier trained for a class . trained without the knowledge of the class hierarchy. It is also possible that the same classifier may have a category for a parent class (for example dog) and also a separate category for a descendant class as well (for example doberman). Let be the score assigned to class to an instance by the classifier

. Without loss of generality we assume that these scores are probabilities 

444For real valued scores they can be converted to probabilities via the soft-max function or via some calibration techniques [25]., that is, , where is the true binary label of the instance for class in the taxonomy. We further note that the accuracies of the classifiers are different. In the graphical model presented later (§ 4) we estimate the accuracies via a validation set and later present an EM algorithm (§ 5) to estimate them without using a validation set.

We implicity assume that the classifier label sets are mappable to some classes in the taxonomy. If this is not true then we can approximate the mappings using class mapping techniques [2, 22, 18].

Generalization of the instance label to paths in the taxonomy—Given a taxonomy the true label of any instance is completely specified by one of the classes in the leaf nodes of the hierarchy. However in practice the class label may be specified by any class of the taxonomy higher than the leaf node. For example, the true label for an instance maybe (which is a leaf node in the Wordnet taxonomy), however the class label as specified by an annotator (or the classifier) could be one of its hypernyms, for example, .

We generalize the notion of the label of an instance to a label path in the taxonomy. The path starts at the root of the taxonomy and terminates at any class in the taxonomy. Paths that end at the leaf node are completely specified. For any instance with class label (not necessarily the leaf node) we denote as the set of all classes starting from to the root node. For a tree taxonomy there is one unique path from the terminal node to the root. For a DAG taxonomy there could be multiple such paths 555For example, in the Wordnet hierarchy the following two hypernym paths can be found for the class since has two parents and . [doberman, pinscher, watchdog, working dog, dog, domestic animal, animal, organism,living thing, whole, object, physical entity, entity] [doberman, pinscher, watchdog, working dog, dog, canine, carnivore, placental, mammal, vertebrate, chordate, animal, organism, living thing, whole, object, physical entity, entity].

Taxonomy consistency—Since the classifiers are trained without the knowledge of the taxonomy, it is very likely that the scores may not be hierarchically consistent. For example, an instance may possibly have a higher score for than . However from the taxonomy we know that a is a , hence should have as high a score as . We define the following notion of taxonomy consistency: For an instance with true class label , that is if , taxonomy consistency implies for all , where are ancestors of . This implies that for all .

Problem statement—Given a set of instances , a taxonomy , where is a set of classes organized hierarchically with the IS-A relationship , a set of trained classifiers , each classifier is trained using a subset of classes , and be the score assigned to class for the instance by the classifier , the task is to aggregate the scores and estimate the best path(s) in the taxonomy for every instance such that the true class label is taxonomically consistent. We could have multiple paths and each path need not necessarily end at a leaf node 666This is the case when we do not have enough confidence to make a prediction till the leaf node. In such case we backoff to an ancestor..

3 Score propagation in the class taxonomy

We will first present a heuristic solution by propagating the classifier scores upward from a particular class to all its ancestors in the taxonomy by navigating the IS-A hierarchy upwards. The scores from multiple classifiers at a node are then aggregated by summing them up. The final path is then estimated by traversing the taxonomy from the root and terminating at a class based on the entropy of the children. Specifically (See Figure 2 for an illustration)

  • First construct an induced sub-graph with classes (which is the union of all classes from the flat classifiers ) and all its ancestors.

  • Initialize the scores of each of the class-nodes in the graph to .

  • For each classifier , add the scores to where .

  • Initialize path to null. Starting with the root node, recursively perform the following steps for every node in the path being constructed.

    • Add node to path.

    • Let be the set of scores .

    • Calculate the entropy of the set and normalize it based on the number of children.

    • if (a pre defined threshold) or is null, quit.

    • else select the node in with the highest value of score .

By calculating the entropy at a particular node, we are trying to check if we can go further down the taxonomy for the instance. If the children have equal scores (if the impurity exceeds a threshold value .) then we do have have enough evidence to make a decision. Hence we back-off and terminate the path at this node.

[animal-2.0,draw [domestic animal-1.7[dog-1.7,name=dog,tier=entry,draw [working dog-0.7,draw [watch dog-0.4 [pinscher-0.4 [doberman-0.4 ]]] [shepherd dog-0.3 [rottweiler-0.3 ]]] [hunting dog [hound [bluetick ]]]]] [carnivore-2.0,name=carnivore,draw [canine-1.9,name=canine,draw [fox-0.2,tier=entry ]] [feline-0.1 [cat-0.1,tier=entry [domestic cat ] [wild cat ]]]] ] [-] (dog) to (canine);

Figure 2: Illustration of the heuristic score propagation algorithm (§ 3) We have two different classifiers (trained with class labels ) and (trained with class labels ). The classifiers assign the following scores to an test instance : , , and , , . The final scores for each class after propagating the scores to the ancestors and summing them up are also shown. The final predicted label path for the instance is marked by rectangles. Note that the path terminates into a non-leaf class .

4 The proposed probabilistic graphical model for aggregating classifiers

The heuristic method assumes that all the classifiers have the same performance and then aggregates the scores. In this section we cast the label aggregation problem as an inference problem in an appropriately defined graphical model (for a given instance

). The proposed graphical model (or Bayesian network) has two kinds of nodes, discrete binary nodes corresponding to hierarchically organized classes in the taxonomy and continuous nodes corresponding to the

classifier scores (see Figure 3, which is the graphical model corresponding to the induced sub-graph in Figure 2).

Each class in the taxonomy corresponds to a binary discrete node which is the true (unknown) binary label of the instance for class in the taxonomy. All the discrete nodes are organized hierarchically with the IS-A relationship defined by the class taxonomy—which defines the conditional independence assumptions of the graphical model. Each binary node is conditioned on its child

node. Taxonomy consistency is ensured by appropriately setting the conditional probability distribution as follows:


where are all the children of . The other entries are estimated using the structure of the taxonomy 777This is a reasonable estimate in the absence of any other information and works well in practice. If the actual counts from a corpus are available (for example the Brown corpus for WordNet) for each class in the taxonomy we can get more precise estimates..

Figure 3: The proposed graphical model (§ 6) for the two classifier label aggregation problem illustrated in Figure 2. The solid nodes (continuous) correspond to the classifier predictions and the other nodes (discrete) correspond the classes in the taxonomy.

For a classifier and a class (which is in the classifier class set) we have a continuous node which is conditioned on the discrete node .We assume that conditioned on the true label the classifier score

is normally distributed 

888This is a reasonable model for scores constructed as a linear (or non-linear) combination of many features. For probabilistic classifiers since the scores lie in the range

we first apply a logit or the inverse softmax transformation to the scores which makes the scores approximately normal.

, that is,


where is a normal distribution with mean

and variance

. The model parameters are estimated by using a validation set with known class labels. We further assume that conditioned on the true label the classifiers are independent.

Having specified the graphical model and the model parameters the task is to find the most likely configuration of the class nodes given the values for the classifier nodes . We actually compute the marginal probability for each of the class nodes and then traverse the taxonomy using the method described in § 3. The graphical model which we have described is a mixed discrete-gaussian network [3]. A mixed discrete-gaussian network consists of both discrete and continuous nodes. The models assumes that the conditional distribution of the continuous variables, given the discrete, is multivariate Gaussian. For such networks exact inference algorithms exist [13], which permits the local computation on the junction tree of exact probabilities, means and variances.

5 Extensions

Discrete labels—We can also incorporate classifiers which produce discrete labels. Instead of the bi-normal distribution we have the following two parameters which define the conditional probability distribution at each classifier node: and .

The EM algorithm for estimating classifier parameters—In § 3

we estimated the parameters by using a separate validation set. Sometimes we may not have access to a labeled validation set. This is especially true when we are interested in aggregating the crowdsourced labels where the goal is estimate the true labels. In such scenarios we can estimate the model parameters directly via the Expectation Maximization(EM) algorithm. The EM algorithm 

[4] is an efficient iterative procedure to estimate the parameters in presence of missing/hidden data. We will use the unknown hidden true label as the missing data in our case. Each iteration of the algorithm consists of two steps: an Expectation(E)-step and a Maximization(M)-step. The M-step involves maximization of a lower bound on the log-likelihood that is refined by the E-step. Specifically, in the E-step we obtain the marginal probabilities of all the class nodes given the current estimate of the model parameters and the observed classifier scores. In the M-step we re-estimate the model parameters given the class labels (marginal probabilities) for the taxonomy class nodes. The only difference being that when estimating the parameters of the bi-normal model we need to use these marginal probabilities instead of the binary labels (from the validation set). These two steps (the E- and the M-step) can be iterated till convergence.

Entry level categories—The proposed algorithm returns a path in the taxonomy. The final label is computed by terminating the path based on the entropy. However for some domains we may want to terminate based on other domain specific criterion. For example, in object recognition task we may want to assign the label of the object to an ‘entry level’ category–the labels people will use to name an object, for example, the entry level category for is . We can appropriately modify our termination strategy to account for this by suitable backing off the path using ideas in [17].

6 Experimental Validation

We experimentally validate our proposed algorithms on two domains, object recognition in images and text categorization. In both these domains we have a natural pre-defined taxonomy.

Datasets—For visual object recognition we use a subset of images from the ImageNet ILSVRC2014 detection challenge dataset [21]. The dataset consists of the 200 basic categories that map to a total of 547 WordNet synsets (including the children). Out of this 547 node taxonomy we extract a sub-hierarchy rooted at the basic category resulting in a taxonomy of 297 nodes. The set of images under these 297 categories are completely non-overlapping and the aggregated set of 70096 images forms the final dataset for our experiments. For our text categorization experiments, we used the benchmark Reuters Corpus Volume 1 (RCV1) news articles dataset [14]. This is a hierarchical multi-labeled dataset where over 138 thousand articles are tagged along topics, industries, and regions facets. We used the 354 class industries taxonomy in our experiments.

Experimental setup—We use of the data for training, for validation and for testing. We randomly select a subset of classes from the taxonomy and train a multi-class flat classifier by completely ignoring the class taxonomy. For object detection, we take the activations of the sixth hidden layer of a deep convolutional neural network as features and then train a linear multi-class SVM with these features [8]. For the text categorization we used the standard token pre-processing available999 for this dataset with tf-idf representation and trained a linear SVM. The classifier is trained using the train split with the validation split being used to tune any hyper-parameters. We experimented with flat classifiers each operating with different(with some overlap) class labels (see Table 1 for the number of classes). For each of the flat classifiers, if a training sample belongs to multiple labels (where one label is a parent of another) then during training-set construction, we randomly assign the instance to one of the classes in order to avoid confounding the classifier. The final model performance is evaluated on the test split which has a representation of all the classes in the taxonomy.

Evaluation metrics—Accuracy is not a good metric for evaluating the performance of the various classifiers because of the hierarchical relations among classes. The classifiers operate only on a subset of the labels in the taxonomy, but the test set can possibly have instances from all the classes in the taxonomy. There is a rich literature on evaluation measures for hierarchical classification (see [10] for a review). For our evaluation we choose the Lowest Common Ancestor (LCA) based precision , recall and -score measures as recommended in [10]. These measures are essentially hierarchical versions of the precision, recall, and -score based on LCA of the actual and the predicted class.

Results—Table 1

shows the mean (and the standard deviation) hierarchical precision

, recall and -score on the test set for each of the individual classifiers and also our proposed heuristic(§ 3) and the graphical model(§ 4) based aggregation algorithms for both the visual object recognition and the text classification tasks. For the graphical model inference we used the junction tree algorithm from the Bayes Net Toolbox101010 The appropriate thresholds for deciding the terminal class were tuned using the validation set. For the heuristic algorithm the path termination was based on the entropy. For the graphical model the terminal node was decided based on the marginal probabilities. For the visual object recognition dataset on all the three measures the performance of the aggregation algorithms are better than or equal to the best performing classifier in the ensemble, the graphical model based approach outperforming the heuristic approach. For the object recognition dataset we also show results for an alternate termination strategy described in § 5 which decides the terminal node by backing off to a suitable entry level class [17]. This alternate termination strategy gave further improvements. For the text classification dataset the graphical model outperforms the best classifier in terms of the precision by a large margin.

   visual object recognition:    18246 instances
Model classes
classifier0 16 0.35 [0.27] 0.37 [0.27] 0.35 [0.26]
classifier1 10 0.33 [0.21] 0.35 [0.22] 0.33 [0.21]
classifier2 18 0.36 [0.28] 0.27 [0.22] 0.30 [0.21]
classifier3 14 0.32 [0.20] 0.33 [0.16] 0.31 [0.14]
classifier4 15 0.32 [0.14] 0.31 [0.16] 0.31 [0.14]
classifier5 10 0.31 [0.14] 0.31 [0.16] 0.30 [0.13]
classifier6 16 0.24 [0.11] 0.31 [0.14] 0.26 [0.11]
classifier7 11 0.21 [0.14] 0.20 [0.13] 0.20 [0.13]
classifier8 10 0.23 [0.13] 0.31 [0.16] 0.26 [0.13]
classifier9 14 0.24 [0.12] 0.33 [0.15] 0.27 [0.12]
heuristic 0.40 [0.19] 0.38 [0.15] 0.38 [0.13]
graphical 0.44 [0.21] 0.39 [0.13] 0.41 [0.13]
entry-level 0.57 [0.29] 0.37 [0.12] 0.44 [0.13]
entry-level 0.58 [0.29] 0.38 [0.12] 0.46 [0.13]
    text classification:    41395 instances
Model classes
classifier0 10 0.29 [0.12] 0.32 [0.15] 0.30 [0.13]
classifier1 10 0.36 [0.19] 0.35 [0.20] 0.35 [0.19]
classifier2 10 0.49 [0.27] 0.54 [0.33] 0.51 [0.28]
classifier3 10 0.52 [0.36] 0.50 [0.34] 0.52 [0.34]
classifier4 10 0.50 [0.28] 0.44 [0.31] 0.47 [0.29]
classifier5 10 0.41 [0.25] 0.40 [0.26] 0.40 [0.25]
classifier6 10 0.46 [0.32] 0.49 [0.31] 0.47 [0.30]
classifier7 10 0.54 [0.34] 0.55 [0.33] 0.54 [0.33]
classifier8 10 0.55 [0.33] 0.54 [0.34] 0.53 [0.32]
classifier9 10 0.34 [0.15] 0.32 [0.15] 0.33 [0.15]
heuristic 0.49 [0.27] 0.46 [0.30] 0.47 [0.28]
graphical 0.77 [0.27] 0.45 [0.28] 0.56 [0.24]
Table 1: Experimental results(§ 6) The mean (and the standard deviation) hierarchical precision , recall and -score measures on the test set for each of the individual classifiers and also our proposed heuristic(§ 3) and the graphical model(§ 4) based aggregation algorithms for both the visual object recognition and the text classification task. For each task and measure the best performing classifier is underlined and the proposed algorithms is in bold it the performance is better than the best individual classifier.

7 Conclusion

In this paper we formulated the problem of aggregating labels from multiple flat classifiers into a possible different hierarchical taxonomy for the labels. This was achieved without modifying or retraining the constituent classifiers in any way. We proposed two solutions, one based on a heuristic score propagation through the taxonomy and a more principled approach using a graphical model. The proposed algorithms were experimentally validated on two real world problems of visual object recognition and text categorization. We plan to extend this approach to taxonomies with cycles and curated knowledge graphs or ontologies. Our model implicity assumes that the classifier label sets are mappable to some classes in the taxonomy. We plan to integrate ideas from catalog integration 

[2] directly into our graphical model.


  • [1]
  • [2] Rakesh Agrawal and Ramakrishnan Srikant. On integrating catalogs. In Proceedings of the 10th international conference on World Wide Web, pages 603–612. ACM, 2001.
  • [3] R. G Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Speigelhalter. Probabilistic networks and expert systems: Exact computational methods for Bayesian networks. Springer, 2006.
  • [4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977.
  • [5] S. Dumais and H. Chen. Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 256–263, 2000.
  • [6] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
  • [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Computer Vision and Pattern Recognition

    , 2014.
  • [8] Donahue J., Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. CoRR, abs/1310.1531, 2013.
  • [9] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In

    Proceedings of the Fourteenth International Conference on Machine Learning

    , pages 170–178, 1997.
  • [10] A. Kosmopoulos, I. Partalas, E. Gaussier, G. Paliouras, and I. Androutsopoulos. Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Mining and Knowledge Discovery, 29(3):820–865, 2015.
  • [11] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  • [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet Classification with Deep Convolutional Neural Networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [13] S. L. Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992.
  • [14] D. D. Lewis, Y. Yang, T. Rose, , and F. Li. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361–397, 2004.
  • [15] L. V. Lita, S. Yu, R. S. Niculescu, and J. Bi. Large scale diagnostic code classification for medical patient records. In IJCNLP, pages 877–882, 2008.
  • [16] G. A. Miller. WordNet: A Lexical Database for English. Communications of the ACM, 38(11):39–41, 1995.
  • [17] V. Ordonez, J. Deng, Y. Choi, A. C. Berg, and T. L. Berg. From large scale image categorization to entry-level categories. In IEEE International Conference on Computer Vision (ICCV), 2013.
  • [18] Panagiotis Papadimitriou, Panayiotis Tsaparas, Ariel Fuxman, and Lise Getoor. Taci: Taxonomy-aware catalog integration. Knowledge and Data Engineering, IEEE Transactions on, 25(7):1643–1655, 2013.
  • [19] I. Partalas, A. Kosmopoulos, N. Baskiotis, T. Artieres, G. Paliouras, E. Gaussier, I. Androutsopoulos, M.-R. Amini, and P. Galinari. Lshtc: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015.
  • [20] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. Journal of Machine Learning Research, 11:1297–1322, April 2010.
  • [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
  • [22] Sunita Sarawagi, Soumen Chakrabarti, and Shantanu Godbole. Cross-training: learning probabilistic mappings between topics. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 177–186. ACM, 2003.
  • [23] C. N. Silla Jr. and A. A. Freitas. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1-2):31–72, 2011.
  • [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. CoRR, abs/1409.4842, 2014.
  • [25] B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694–699, 2002.