Log In Sign Up

Hierarchical Class-Based Curriculum Loss

by   Palash Goyal, et al.

Classification algorithms in machine learning often assume a flat label space. However, most real world data have dependencies between the labels, which can often be captured by using a hierarchy. Utilizing this relation can help develop a model capable of satisfying the dependencies and improving model accuracy and interpretability. Further, as different levels in the hierarchy correspond to different granularities, penalizing each label equally can be detrimental to model learning. In this paper, we propose a loss function, hierarchical curriculum loss, with two properties: (i) satisfy hierarchical constraints present in the label space, and (ii) provide non-uniform weights to labels based on their levels in the hierarchy, learned implicitly by the training paradigm. We theoretically show that the proposed loss function is a tighter bound of 0-1 loss compared to any other loss satisfying the hierarchical constraints. We test our loss function on real world image data sets, and show that it significantly substantially outperforms multiple baselines.


page 1

page 2

page 3

page 4


Learn Class Hierarchy using Convolutional Neural Networks

A large amount of research on Convolutional Neural Networks has focused ...

Target Curricula via Selection of Minimum Feature Sets: a Case Study in Boolean Networks

We consider the effect of introducing a curriculum of targets when train...

On Horizontal and Vertical Separation in Hierarchical Text Classification

Hierarchy is a common and effective way of organizing data and represent...

Label-similarity Curriculum Learning

Curriculum learning can improve neural network training by guiding the o...

Rank-based loss for learning hierarchical representations

Hierarchical taxonomies are common in many contexts, and they are a very...

The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling

In this paper, we propose a new metric to measure goodness-of-fit for cl...

Bayes-optimal Hierarchical Classification over Asymmetric Tree-Distance Loss

Hierarchical classification is supervised multi-class classification pro...

1 Introduction

Machine learning (ML) models are trained on class labels that often have an underlying taxonomy or hierarchy defined over the label space. For example, a set of images may contain objects like “building" and “bulldog". There exists a class-subclass relation between the “dog" and “bulldog" classes — so, if the model predicts the object to be a dog instead of bulldog, a human evaluator will consider the error to be mild. In comparison, if the model predicts the object to be “stone" or “car", then the error will be more egregious. Although such nuances are often not visible through standard evaluation metrics, they are extremely important when deploying ML models in real world scenarios.

Hierarchical multi-label classification (HMC) methods, which utilize the hierarchy of class labels, aim to tackle the above issue. Traditional methods in this domain broadly use one of three approaches: (i) architectural modifications to the original model, to learn either levels or individual classes separately, (ii) converting the discrete label space to a continuous one embedding the labels using relations between them, and (iii) modifying the loss function adding more weights to specific classes in the hierarchy. However, the methods in this domain are mostly empirical and the choice of modifications is often experimental. To overcome this issue, we aim to incorporate the class dependencies in the loss function in a systematic fashion. To this end, we propose a formulation to incorporate hierarchical constraint in a base loss function and show that our proposed loss is a tight bound to the base loss.

Further, we note that typically humans do not learn all the categories of objects at the same time but learn them gradually starting with simple high level categories. A similar setting was explored by Bengio et. al., introducing the concept of curriculum learning feeding the model easier examples to mimic the way of human learning. They show that learning simple examples first makes the model learn a smoother function. Lyu et. al. extended this to define an example-based curriculum loss with theoretical bounds to 0-1 loss. We extend our hierarchically constrained loss function to incorporate a class-based curriculum learning paradigm, implicitly providing higher weights to simpler classes. With the hierarchical constraints, the model ensures that the classes higher in the hierarchy are selected to provide training examples until the model learns to identify them correctly, before moving on to classes deeper in the hierarchy (making the learning problem more difficult).

We theoretically show that our proposed loss function, hierarchical class-based curriculum loss, is a tight bound on 0-1 loss. We also show that any other loss function that satisfies hierarchical constraints on a given base loss gives a higher loss compared to our loss. We evaluate this result empirically on two image data sets, showing that our loss function provides a significant improvement on the hierarchical distance metric compared to the baselines. We also show that, unlike many other hierarchical multi-label classification methods, our method doesn’t decrease the performance on non-hierarchical metrics and in most cases gives significant improvement over the baselines.

Overall, we make the following contributions in this paper:

  1. [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  2. We introduce a hierarchically constrained loss function to account for hierarchical relationship between labels in a hierarchical taxonomy.

  3. We provide theoretical analysis for proving that this formulation of adding constraints renders the hierarchical loss function tightly bound w.r.t. the 0-1 loss.

  4. We add a class-based curriculum loss formulation on the constrained loss function, based on the intuition that shallower classes in the hierarchy are easier to learn compared to deeper classes in the same taxonomy path.

  5. We show that our class-based hierarchical curriculum loss renders a tighter bound to 0-1 loss by smoothing the hierarchical loss.

  6. We experimentally show the superiority of the proposed loss function compared to other state-of-the-art loss functions (e.g., cross entropy loss), w.r.t. multiple metrics (e.g., hierarchical distance, Hit@1).

  7. We provide ablation studies on the constraints and curriculum loss to empirically show the interplay between them on multiple datasets.

The rest of the paper is organized as follows. In Section 2, we cover the related works in this domain. Then in Section 3, we provide a mechanism to incorporate hierarchical constraints and show that the proposed loss is tightly bounded to the base loss. We then extend it to incorporate the class-based curriculum loss and show a tighter bound to 0-1 loss. In Section 4, we show experimental evaluations and ablations on two real world image data sets. Finally, we conclude in Section 5 summarizing our findings and mentioning future work.

2 Related Work

Research in hierarchical classification falls into three categories: (i) changing the label space from discrete to continuous by embedding the relation between labels, (ii) making structural modifications to the base architecture as per the hierarchy, and (iii) adding hierarchical regularizers and other loss function modifications.

Label-embedding methods learn a function to map class labels to continuous vectors capable of reconstructing the relation between labels. One advantage of such methods is their generalizability to the type of relations between labels. They represent the relation between labels as a general graph and use an embedding approach 

Goyal and Ferrara (2018) to generate a continuous vector representation. Once the label space is continuous, they use a continuous label prediction model and predict the embedding. The disadvantage is typically the difficulty of mapping back the prediction to the discrete space and the noise introduced in this conversion. In text domain, works typically use word2vec Mikolov et al. (2013) and Glove Pennington et al. (2014) to map the words to vectors. Ghosh et. al Ghosh et al. (2016) use contextual knowledge to constrain LSTM-based text embeddings, Miyazaki et. al Miyazaki et al. (2019) use a combination of Bi-LSTMs and hierarchical attention to learn embeddings for labels. For a general domain, Kumar et. alKumar et al. (2018) use maximum-margin matrix factorization to get a piece-wise linear mapping of labels from discrete to continuous space. DeViSE Frome et al. (2013) method maps classes to a unit hypersphere using analysis on Wikipedia text. TLSE Chen et al. (2019) uses a two-stage label embedding method using neural factorization machine He and Chua (2017) to jointly project features and labels into a latent space. CLEMS Huang and Lin (2017) propose a cost-sensitive label embedding using classic multidimensional scaling approach for manifold learning. SoftLabels Bertinetto et al. (2019) assigns soft penalties for classes as we go farther from the ground truth label in the hierarchy.

Models which perform structural modifications use earlier layers in the network to predict higher level categories and later layers to predict lower level categories. HMCN Wehrmann et al. (2018)

defines a neural network layer for each layer in the hierarchy and uses skip connections between input and subsequent layers to ensure each layer gets prediction output of previous layer and the input. They also proposed a variant HMCN-R which uses recurrent neural network (RNN) to share parameters and show the performance only deteriorates a little using an RNN. AWX 

Masera and Blanzieri (2018)

also proposes a hierarchical output layer which can be plugged at the end of any classifier to make it hierarchical. HMC-LMLP 

Cerri et al. (2014) propose a local model in which they define a neural network for each term in the hierarchy. Alsallakh Bilal et al. (2017) combines hierarchical modifications using visual-analytics method and a hierarchical loss and show how capturing hierarchy can significantly improve AlexNet. Note that this class of methods in which we modify structural modifications is often domain dependent often needing a lot of time to analyze the data and come up with the modifications. In comparison, our loss based HCL method is easier to implement and domain agnostic.

Finally, models which modify the loss function to incorporate hierarchy assign a higher penalty to the prediction of labels which are more distant from the ground truth. AGGKNN-L and ADDKNN-L-CSL Verma et al. (2012) modify the metric space and use a lowest common ancestor (LCA) based penalty between the classes to train their model. Similarly, Deng et. al. Deng et al. (2010) modify the loss function to introduce hierarchical cost giving penalty based on the height of LCA. CNN-HL-LI Wu et al. (2016) use a weighting parameter to control the contribution of fine-grained classes which is empirically learned. HXE Bertinetto et al. (2019) use a probabilistic approach to assign penalties for a given class given the parent class and provide an information theoretic interpretation for it. In another interesting work by Baker et. al. Baker and Korhonen (2017), authors initialize weights of the final layer in the neural network according to the relations between labels and show significant gains in the performance. In this paper, we propose a novel hierarchical loss function. Neural network models have also used different kinds of loss functions for particular domain applications, e.g., focal loss for object detection Lin et al. (2018); Li et al. (2019); Zhang et al. (2020), ranking loss to align preferences Chen et al. (2009); Selvaraju et al. (2019); Zhang et al. (2019).

The domain of curriculum learning was introduced by Bengio et. al. Bengio et al. (2009) based on the observation that humans learn much faster when presented information in a meaningful order as opposed to random which is typically used for training machine learning models. They tested this idea on machine learning by providing the model with easy examples first and then subsequently providing harder examples. Several follow up works have shown this type of learning to be successful. Cyclic learning rates Smith (2017) combine curriculum learning with simulated annealing to modify the learning rate between a range and achieve faster convergence. Smith et. al. Smith and Topin (2019) further extend this to get super-convergence using large learning rates. Teacher-student curriculum learning Matiisen et al. (2019) uses a teacher to assist the student in learning the tasks at which the student makes the fastest progress. Curriculum loss Lyu and Tsang (2019) introduces a modified loss function to select easy examples automatically.

The above works on hierarchical multi-label classification are in general empirical and require careful study of the application domain. In this work, we propose a hierarchical class-based curriculum loss with theoretical analysis and provable bounds rendering it free of hyperparameters. Further, we propose a class-based curriculum loss to enhance the performance of hierarchically constrained loss.

3 Hierarchical Class-Based Curriculum Loss

For the task of multi-class classification, given a multi-class loss function, our goal is to incorporate the hierarchical constraints present in the label space into the given loss function. Note that we consider the general problem of multi-label multi-class classification of which a single label multi-class classification is an instance, thus yielding our method applicable to a wide variety of problems. In this section, we first define the general formulation of multi-class multi-label classification. We then define a hierarchical constraint which we require to be satisfied for a hierarchical label space. We then introduce our formulation of a hierarchically constrained loss and show that the proposed loss function indeed satisfies the constraint. We then prove a bound on the proposed loss. We extend the loss function to implicitly use a curriculum learning paradigm and show a tight bound to 0-1 loss using this. Finally, we present our algorithm to train the model using the our loss function.

3.1 Incorporating Hierarchical Constraints

Consider the learning framework with training set with training examples and input image features . We represent the labels as where is the number of classes and means that the example belongs to class. A 0-1 multi-class multi-label loss with this setting can be defined as follows:


Let the set of classes be arranged in a hierarchy defined by a hierarchy mapping function which maps a category to its children categories. We use the function to denote the mapping from a category to its level in the hierarchy. We now define the following hierarchical constraint on a generic loss function , the satisfaction of which would yield the loss function :


The constraint implies that the loss increases monotonically with the level of the hierarchy i.e. loss of higher (i.e. closer to the root) levels in the hierarchy is lesser than that of the lower levels (i.e. closer to the leaves). The intuition is that identifying categories in higher level is easier than categories in lower level as they are coarser. Its violation would mean that the model is able to differentiate between finer-grained classes more easily compared to coarse classes which is counterintuitive.

We now propose , a variant of the 0-1 loss , and show that it satisfies the hierarchical constraint :


We now show the following result for :

Theorem 1.

(Hierarchically Constrained 0-1 Loss) Given a 0-1 loss function , the loss function defined in Equation 3 satisfies the hierarchical constraint defined in Equation 2.


Let us assume that doesn’t follow hierarchical constraint i.e.


However, we have


which contradicts the assumption in Eq. 4. This concludes that follows hierarchical constraint .

Now, we show that the hierarchically constrained loss function is tightly bounded to the base function:

Theorem 2.

(Bound on Constrained 0-1 Loss) For a 0-1 loss function , the loss function defined in Equation 3 is an element-wise tight bound on with constraint . Let denote elementwise inequality i.e. means . We then have:


Let us assume that s.t. i.e.


As is a 0-1 loss function, this implies the following:


Substituting definition of from equation 3, from the second condition we get


This leads to two cases:

Case 1: If , then as , we have which violates the condition.

Case 2: If , then s.t. . Since , we have for . However, which violates constraint.

Thus, we get . ∎

The above results can be generalized for any base loss function as follows:

Corollary 1.

(Bound on Constrained Generic Loss) For any loss function , the loss function defined below is an element-wise tight bound on with constraint i.e.


the following relation holds:


We have shown above that the hierarchy constrained loss function provides an element-wise tight bound on the base loss . We now extend this loss function to use a curriculum learning paradigm and show that the loss is a tighter bound to 0-1 loss compared to any other loss satisfying hierarchical constraints.

3.2 Hierarchical Curriculum Loss

As shown by Hu et. al. Hu et al. (2016), 0-1 loss ensures that the empirical risk has a monotonic relation with adversarial empirical risk. However, it is non-differentiable and difficult to optimize. Following the groundwork by Lyu et. al. Lyu and Tsang (2019) who propose example based curriculum loss, we present a class-based curriculum loss for any given loss function following the hierarchical constraint in the following theorem. The theorem also proves that the function defined is tighter bound to 0-1 loss compared to any loss function which satisfies the hierarchical constraint and is lower bounded by . Note that a general loss function is element-wise lower bounded by 0-1 loss i.e. .

Theorem 3.

(Hierarchical Class-Based Curriculum Loss) For a general hierarchy constrained loss function , we define the loss function as follows:


Then, i.e. the following holds




For the lower bound we have,


We thus have . Subtracting from both sides, we get the theorem. ∎

3.3 Algorithm

Function selectClasses (Training Data , Base Loss , Threshold )
       for  do
             for  do
                   += ;
      Get s.t. ;
       for  do
             if   then
Algorithm 1 Class Selection for Hierarchical Class-Based Curriculum Learning

In the above theorem, we prove that the proposed hierarchical class-based curriculum loss provides a tighter bound to 0-1 loss compared to the hierarchically constrained loss function. Given the above, we now need to find the optimal class selection parameters for each class. We show that Algorithm 1 provides the optimal selection parameters:

Theorem 4.

(Class Selection) Given a base loss function , a hierarchically constrained loss function , a solution for Equation 12 is provided by Algorithm 1.

The proof for above is provided in the Appendix. Note that the time complexity of above algorithm is and is thus computationally inexpensive as the number of classes doesn’t typically go over orders of thousands.

4 Experiments

We first perform an ablation study on each of the component of hierarchical class-based curriculum loss, including the hierarchically constrained loss and class-based curriculum loss, and show how they interplay to provide the final loss function. We then compare our loss function against the state-of-the-art losses to show its performance gain.

4.1 Experimental Setup

We evaluate our loss function on two real world image data sets – Diatoms Dimitrovski et al. (2012) and IMCLEF Dimitrovski et al. (2011)

. Diatoms data set contains 3,119 images of diatoms (a large and ecologically important group of unicellula or colonial organisms (algae)). Each diatom can correspond to one or many of the categories arranged in a hierarchy. Overall, there are 399 categories in this data set arranged into a hierarchy of height 4 containing 47 categories. On the other hand, IMCLEF data set contains images of x-rays of human bodies and classes correspond to the body parts arranged in a hierarchy of height 4. For both these data sets, we use a pre extracted feature set, extracted using techniques of image segmentation including Fourier transforms and histograms of local SIFT descriptors.

For evaluation, we use a multi-layer perceptron with the above extracted features as input and the categories as output. We select the hyperparameters of the neural network using evaluation on a validation set with binary cross entropy loss. Based on this, we get a structure with 800 hidden neurons and a dropout of 0.25. Note that we fix this network for all the baseline loss functions and our loss function to ensure fair comparison of results. We compare the hierarchical class-based curriculum loss with the following state-of-the-art losses – (i) binary cross entropy loss 

Goodfellow et al. (2016), (ii) focal loss Lin et al. (2017) and (iii) hierarchical cross entropy loss Bertinetto et al. (2019). Further, we also compare it with SoftLabels Bertinetto et al. (2019), which modifies the ground truth labels in accordance with the hierarchy.

We use the following metrics to evaluate each of the losses for the classification task – (i) Hit1, (ii) MRR (Mean Reciprocal Rank) Radev et al. (2002), and (iii) HierDist Deng et al. (2010). The first three metrics capture the accuracy of ranking of the model predictions and hierarchy capturing methods often show lower performance compared to non-hierarchical methods as the losses get more constrained. Hierarchical methods often show improvements on a metric which captures how close to the ground truth class the prediction is in the given hierarchy.

Our final metric, HierDist, captures this and can be defined as the minimum height of the lowest common ancestor (LCA) between the ground truth labels and the top prediction from the model. Mathematically, for a data point , it can be defined as follows:

where denotes the hierarchy of the labels. Note that as pointed out by Deng Deng et al. (2010), the metric is effectively on a log scale. It is measured by the height in the hierarchy of the lowest common ancestor, and moving up a level can more than double the number of descendants depending on the fan out of the parent class (often greater than 3-4). We show that our loss function is superior to the baseline losses for this metric. In addition, our model’s performance also doesn’t deteriorate on non-hierarchical metrics.

4.2 Ablation Studies

Methods Diatoms IMCLEF
Hit@1 MRR HierDist Hit@1 MRR HierDist
CrossEntropy 74.45 85.19 1.26 90.45 93.33 0.35
HCL-Hier 75.12 81.46 1.23 90.45 93.33 0.33
HCL-CL 74.93 81.34 1.24 90.65 93.34 0.32
HCL 75.21 81.56 1.22 90.95 93.49 0.22
Table 1: Ablation studies showing the effect of each component of our loss function .

We show the effects of the hierarchical constraints and the curriculum loss using cross entropy loss as the base function in Table 1. The results in the data set show evaluation on a held out test data.

Consider the hierarchically constrained cross entropy loss, HCL-Hier. For Diatoms, we observe that the loss function gives improvement on the hierarchical metric significantly decreasing the hierarchical distance from 1.26 to 1.23. Further, we see that the non-hierarchical metrics also improve compared to the baseline loss. For IMCLEF, we observe that the non-hierarchical metrics stay similar to the cross entropy baseline but the loss gives improvement on the hierarchical distance metric. This shows that our formulation of adding the hierarchical constraint doesn’t deteriorate typical metrics all the while making the predictions more consistent with the label space hierarchy as well as making predictions closer to the ground truth label in the hierarchical space. We believe that the improvement in non-hierarchical metrics may be attributed to the modified ranking due to hierarchical constraint making the top predictions more likely.

Looking into class-based curriculum loss individually, we observe that as suggested by theory, carefully selecting the classes based on the their loss (implicitly giving more weight to simpler classes) improves the hierarchical evaluation metric, making the predictions more relevant. Further, as this way of learning also makes the learned model function smoother, we observe that in both the data sets the non-hierarchical metrics also improve significantly with respect to baseline. In both the data sets, we observe similar trends of getting improvement in all the metrics.

Overall, combining the two aspects yields us HCL which shows the best performance. We see that for all the metrics, HCL gives significant gains both with respect to baseline loss and individual components. This follows theory in which we have show that combining class-based curriculum loss with hierarchical constraints gives a tighter bound to 0-1 loss with respect to the hierarchically constrained loss. Further, as this loss explicitly ensures that loss of a higher level node is lower than the a lower level, and implicitly gives more weight to higher level node as they are selected more, the combined effect makes it more consistent to the hierarchy. This can be particularly seen with IMCLEF data set, for which each individual component gave good improvements for the hierarchical metric but the combined loss gave much more significant gain. We also note that Diatoms has 399 classes compared to 47 classes of IMCLEF with similar number of levels making the number of categories in each layer around 8 times higher making the gains in HierDist more difficult to attain.

4.3 Results

Methods Diatoms IMCLEF
Hit@1 MRR HierDist Hit@1 MRR HierDist
CrossEntropy 74.45 81.00 1.26 90.45 93.33 0.35
FocalLoss 74.36 80.47 1.27 90.85 93.60 0.26
Hier-CE 75.12 81.51 1.24 90.85 93.63 0.24
SoftLabels 72.36 74.95 1.38 90.45 92.07 0.33
HCL 75.21 81.56 1.22 90.95 93.49 0.22
Table 2: Hierarchical Image Classification Results on Diatoms and IMCLEF data sets. We use pre-extracted features with a multi-layer perceptron as our base model.

We now compare our proposed loss (HCL) with the state-of-the-art loss functions capturing hierarchy as well as a label embedding method (SoftLabels). From Table 2, we observe that our loss significantly outperforms the base loss functions. Consistent with the results presented by Bertinetto et al.Bertinetto et al. (2019), we see that previously proposed hierarchically constrained loss functions especially SoftLabels improve on hierarchical metrics but the performance deteriorates for non-hierarchical metrics. On the other hand, HCL’s performance improves or stays comparable to baselines on non-hierarchical metrics, while getting significant gains on the hierarchical metric over state-of-the-art loss functions. We observe that Hier-CE is the best performing baseline but our model outperforms it on every metric except MRR on IMCLEF. As pointed above, the gains in Diatoms are more difficult to obtain given the number of classes in each level but our model gives visible gains on it as well.

5 Conclusion

In this paper, we proposed a novel loss function for multi-label multi-class classification task ensuring the predictions are consistent with the hierarchical constraints present in the label space. For hierarchical loss, we proposed a formulation to modify any loss function to incorporate hierarchical constraint and show that this loss function is a tight bound on the base loss compared to any other loss satisfying hierarchical constraints. We next proposed a class-based curriculum loss, which implicitly gives more weight to simpler classes rendering the model smoother and more consistent with the hierarchy as shown experimentally. We combined the two components and theoretically showed that the combination provides a tighter bound to 0-1 loss, rendering it more robust and accurate. We validated our theory with experiments on two multi-label multi-class hierarchical image data sets of human x-rays and diatoms. We showed that the results are consistent with the theory and show significant gains on the hierarchical metric for the data sets. Further, we observed that our models also improve on non-hierarchical metrics making the the loss function more widely applicable. In the future, we would like to relax the hierarchical constraints and develop a loss for a general graph based relation structure between the labels. Further, we would like to the test the model on other real world data sets which contain relation between labels. Finally, we would also like to test the model performance when we introduce noise into the hierarchy and the labels.


  • [1] S. Baker and A. Korhonen (2017) Initializing neural networks for hierarchical multi-label text classification. In BioNLP 2017, Cited by: §2.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §2.
  • [3] L. Bertinetto, R. Mueller, K. Tertikas, S. Samangooei, and N. A. Lord (2019) Making better mistakes: leveraging class hierarchies with deep networks. arXiv preprint arXiv:1912.09393. Cited by: §2, §2, §4.1, §4.3.
  • [4] A. Bilal, A. Jourabloo, M. Ye, X. Liu, and L. Ren (2017)

    Do convolutional neural networks learn class hierarchy?

    IEEE transactions on visualization and computer graphics 24 (1), pp. 152–162. Cited by: §2.
  • [5] R. Cerri, R. C. Barros, and A. C. De Carvalho (2014) Hierarchical multi-label classification using local neural networks. Journal of Computer and System Sciences 80 (1), pp. 39–56. Cited by: §2.
  • [6] C. Chen, H. Wang, W. Liu, X. Zhao, T. Hu, and G. Chen (2019) Two-stage label embedding via neural factorization machine for multi-label classification. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3304–3311. Cited by: §2.
  • [7] W. Chen, T. Liu, Y. Lan, Z. Ma, and H. Li (2009) Ranking measures and loss functions in learning to rank. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems., Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 315–323. Cited by: §2.
  • [8] J. Deng, A. C. Berg, K. Li, and L. Fei-Fei (2010) What does classifying more than 10,000 image categories tell us?. In

    European conference on computer vision

    pp. 71–84. Cited by: §2, §4.1, §4.1.
  • [9] I. Dimitrovski, D. Kocev, S. Loskovska, and S. Džeroski (2011) Hierarchical annotation of medical images. Pattern Recognition 44 (10-11), pp. 2436–2449. Cited by: §4.1.
  • [10] I. Dimitrovski, D. Kocev, S. Loskovska, and S. Džeroski (2012) Hierarchical classification of diatom images using ensembles of predictive clustering trees. Ecological Informatics 7 (1), pp. 19–29. Cited by: §4.1.
  • [11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, Cited by: §2.
  • [12] S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, and L. P. Heck (2016) Contextual LSTM (CLSTM) models for large scale NLP tasks. In

    KDD Workshop on Large-scale Deep Learning for Data Mining (DL-KDD)

    Cited by: §2.
  • [13] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §4.1.
  • [14] P. Goyal and E. Ferrara (2018) Graph embedding techniques, applications, and performance: a survey. Knowledge-Based Systems 151, pp. 78 – 94. External Links: ISSN 0950-7051, Document, Link Cited by: §2.
  • [15] X. He and T. Chua (2017) Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: §2.
  • [16] W. Hu, G. Niu, I. Sato, and M. Sugiyama (2016)

    Does distributionally robust supervised learning give robust classifiers?

    arXiv preprint arXiv:1611.02041. Cited by: §3.2.
  • [17] K. Huang and H. Lin (2017) Cost-sensitive label embedding for multi-label classification. Machine Learning 106 (9-10), pp. 1725–1746. Cited by: §2.
  • [18] V. Kumar, A. K. Pujari, V. Padmanabhan, S. K. Sahu, and V. R. Kagita (2018) Multi-label classification using hierarchical embedding. Expert Systems with Applications 91, pp. 263–269. Cited by: §2.
  • [19] D. Li, S. Tasci, S. Ghosh, J. Zhu, J. Zhang, and L. P. Heck (2019) Efficient incremental learning for mobile object detection. In Proceedings of the ACM/IEEE Symposium on Edge Computing (SEC), Cited by: §2.
  • [20] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.1.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • [22] Y. Lyu and I. W. Tsang (2019) Curriculum loss: robust learning and generalization against label corruption. arXiv preprint arXiv:1905.10045. Cited by: §2, §3.2.
  • [23] L. Masera and E. Blanzieri (2018) Awx: an integrated approach to hierarchical-multilabel classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 322–336. Cited by: §2.
  • [24] T. Matiisen, A. Oliver, T. Cohen, and J. Schulman (2019) Teacher-student curriculum learning. IEEE transactions on neural networks and learning systems. Cited by: §2.
  • [25] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    arXiv preprint arXiv:1301.3781. Cited by: §2.
  • [26] T. Miyazaki, K. Makino, Y. Takei, H. Okamoto, and J. Goto (2019) Label embedding using hierarchical structure of labels for twitter classification. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 6318–6323. Cited by: §2.
  • [27] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation.. In EMNLP, Vol. 14, pp. 1532–43. Cited by: §2.
  • [28] D. R. Radev, H. Qi, H. Wu, and W. Fan (2002) Evaluating web-based question answering systems.. In LREC, Cited by: §4.1.
  • [29] R. R. Selvaraju, S. Lee, Y. Shen, H. Jin, S. Ghosh, L. Heck, D. Batra, and D. Parikh (2019) Taking a hint: leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2591–2600. Cited by: §2.
  • [30] L. N. Smith and N. Topin (2019) Super-convergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Vol. 11006, pp. 1100612. Cited by: §2.
  • [31] L. N. Smith (2017) Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: §2.
  • [32] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair (2012) Learning hierarchical similarity metrics. In 2012 IEEE conference on computer vision and pattern recognition, pp. 2280–2287. Cited by: §2.
  • [33] J. Wehrmann, R. Cerri, and R. Barros (2018) Hierarchical multi-label classification networks. In International Conference on Machine Learning, pp. 5075–5084. Cited by: §2.
  • [34] H. Wu, M. Merler, R. Uceda-Sosa, and J. R. Smith (2016) Learning to make better mistakes: semantics-aware visual food recognition. In Proceedings of the 24th ACM international conference on Multimedia, pp. 172–176. Cited by: §2.
  • [35] H. Zhang, S. Ghosh, L. Heck, S. Walsh, J. Zhang, J. Zhang, and C. Kuo (2019) Generative visual dialogue system via adaptive reasoning and weighted likelihood estimation. In Proceedings of IJCAI, Cited by: §2.
  • [36] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. P. Heck, H. Zhang, and C.-C. J. Kuo (2020) Class-incremental learning via deep model consolidation. In IEEE Winter Conference on Applications of Computer Vision, WACV, pp. 1120–1129. Cited by: §2.