1 Introduction
Machine learning (ML) models are trained on class labels that often have an underlying taxonomy or hierarchy defined over the label space. For example, a set of images may contain objects like “building" and “bulldog". There exists a classsubclass relation between the “dog" and “bulldog" classes — so, if the model predicts the object to be a dog instead of bulldog, a human evaluator will consider the error to be mild. In comparison, if the model predicts the object to be “stone" or “car", then the error will be more egregious. Although such nuances are often not visible through standard evaluation metrics, they are extremely important when deploying ML models in real world scenarios.
Hierarchical multilabel classification (HMC) methods, which utilize the hierarchy of class labels, aim to tackle the above issue. Traditional methods in this domain broadly use one of three approaches: (i) architectural modifications to the original model, to learn either levels or individual classes separately, (ii) converting the discrete label space to a continuous one embedding the labels using relations between them, and (iii) modifying the loss function adding more weights to specific classes in the hierarchy. However, the methods in this domain are mostly empirical and the choice of modifications is often experimental. To overcome this issue, we aim to incorporate the class dependencies in the loss function in a systematic fashion. To this end, we propose a formulation to incorporate hierarchical constraint in a base loss function and show that our proposed loss is a tight bound to the base loss.
Further, we note that typically humans do not learn all the categories of objects at the same time but learn them gradually starting with simple high level categories. A similar setting was explored by Bengio et. al., introducing the concept of curriculum learning feeding the model easier examples to mimic the way of human learning. They show that learning simple examples first makes the model learn a smoother function. Lyu et. al. extended this to define an examplebased curriculum loss with theoretical bounds to 01 loss. We extend our hierarchically constrained loss function to incorporate a classbased curriculum learning paradigm, implicitly providing higher weights to simpler classes. With the hierarchical constraints, the model ensures that the classes higher in the hierarchy are selected to provide training examples until the model learns to identify them correctly, before moving on to classes deeper in the hierarchy (making the learning problem more difficult).
We theoretically show that our proposed loss function, hierarchical classbased curriculum loss, is a tight bound on 01 loss. We also show that any other loss function that satisfies hierarchical constraints on a given base loss gives a higher loss compared to our loss. We evaluate this result empirically on two image data sets, showing that our loss function provides a significant improvement on the hierarchical distance metric compared to the baselines. We also show that, unlike many other hierarchical multilabel classification methods, our method doesn’t decrease the performance on nonhierarchical metrics and in most cases gives significant improvement over the baselines.
Overall, we make the following contributions in this paper:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

We introduce a hierarchically constrained loss function to account for hierarchical relationship between labels in a hierarchical taxonomy.

We provide theoretical analysis for proving that this formulation of adding constraints renders the hierarchical loss function tightly bound w.r.t. the 01 loss.

We add a classbased curriculum loss formulation on the constrained loss function, based on the intuition that shallower classes in the hierarchy are easier to learn compared to deeper classes in the same taxonomy path.

We show that our classbased hierarchical curriculum loss renders a tighter bound to 01 loss by smoothing the hierarchical loss.

We experimentally show the superiority of the proposed loss function compared to other stateoftheart loss functions (e.g., cross entropy loss), w.r.t. multiple metrics (e.g., hierarchical distance, Hit@1).

We provide ablation studies on the constraints and curriculum loss to empirically show the interplay between them on multiple datasets.
The rest of the paper is organized as follows. In Section 2, we cover the related works in this domain. Then in Section 3, we provide a mechanism to incorporate hierarchical constraints and show that the proposed loss is tightly bounded to the base loss. We then extend it to incorporate the classbased curriculum loss and show a tighter bound to 01 loss. In Section 4, we show experimental evaluations and ablations on two real world image data sets. Finally, we conclude in Section 5 summarizing our findings and mentioning future work.
2 Related Work
Research in hierarchical classification falls into three categories: (i) changing the label space from discrete to continuous by embedding the relation between labels, (ii) making structural modifications to the base architecture as per the hierarchy, and (iii) adding hierarchical regularizers and other loss function modifications.
Labelembedding methods learn a function to map class labels to continuous vectors capable of reconstructing the relation between labels. One advantage of such methods is their generalizability to the type of relations between labels. They represent the relation between labels as a general graph and use an embedding approach
Goyal and Ferrara (2018) to generate a continuous vector representation. Once the label space is continuous, they use a continuous label prediction model and predict the embedding. The disadvantage is typically the difficulty of mapping back the prediction to the discrete space and the noise introduced in this conversion. In text domain, works typically use word2vec Mikolov et al. (2013) and Glove Pennington et al. (2014) to map the words to vectors. Ghosh et. al Ghosh et al. (2016) use contextual knowledge to constrain LSTMbased text embeddings, Miyazaki et. al Miyazaki et al. (2019) use a combination of BiLSTMs and hierarchical attention to learn embeddings for labels. For a general domain, Kumar et. alKumar et al. (2018) use maximummargin matrix factorization to get a piecewise linear mapping of labels from discrete to continuous space. DeViSE Frome et al. (2013) method maps classes to a unit hypersphere using analysis on Wikipedia text. TLSE Chen et al. (2019) uses a twostage label embedding method using neural factorization machine He and Chua (2017) to jointly project features and labels into a latent space. CLEMS Huang and Lin (2017) propose a costsensitive label embedding using classic multidimensional scaling approach for manifold learning. SoftLabels Bertinetto et al. (2019) assigns soft penalties for classes as we go farther from the ground truth label in the hierarchy.Models which perform structural modifications use earlier layers in the network to predict higher level categories and later layers to predict lower level categories. HMCN Wehrmann et al. (2018)
defines a neural network layer for each layer in the hierarchy and uses skip connections between input and subsequent layers to ensure each layer gets prediction output of previous layer and the input. They also proposed a variant HMCNR which uses recurrent neural network (RNN) to share parameters and show the performance only deteriorates a little using an RNN. AWX
Masera and Blanzieri (2018)also proposes a hierarchical output layer which can be plugged at the end of any classifier to make it hierarchical. HMCLMLP
Cerri et al. (2014) propose a local model in which they define a neural network for each term in the hierarchy. Alsallakh Bilal et al. (2017) combines hierarchical modifications using visualanalytics method and a hierarchical loss and show how capturing hierarchy can significantly improve AlexNet. Note that this class of methods in which we modify structural modifications is often domain dependent often needing a lot of time to analyze the data and come up with the modifications. In comparison, our loss based HCL method is easier to implement and domain agnostic.Finally, models which modify the loss function to incorporate hierarchy assign a higher penalty to the prediction of labels which are more distant from the ground truth. AGGKNNL and ADDKNNLCSL Verma et al. (2012) modify the metric space and use a lowest common ancestor (LCA) based penalty between the classes to train their model. Similarly, Deng et. al. Deng et al. (2010) modify the loss function to introduce hierarchical cost giving penalty based on the height of LCA. CNNHLLI Wu et al. (2016) use a weighting parameter to control the contribution of finegrained classes which is empirically learned. HXE Bertinetto et al. (2019) use a probabilistic approach to assign penalties for a given class given the parent class and provide an information theoretic interpretation for it. In another interesting work by Baker et. al. Baker and Korhonen (2017), authors initialize weights of the final layer in the neural network according to the relations between labels and show significant gains in the performance. In this paper, we propose a novel hierarchical loss function. Neural network models have also used different kinds of loss functions for particular domain applications, e.g., focal loss for object detection Lin et al. (2018); Li et al. (2019); Zhang et al. (2020), ranking loss to align preferences Chen et al. (2009); Selvaraju et al. (2019); Zhang et al. (2019).
The domain of curriculum learning was introduced by Bengio et. al. Bengio et al. (2009) based on the observation that humans learn much faster when presented information in a meaningful order as opposed to random which is typically used for training machine learning models. They tested this idea on machine learning by providing the model with easy examples first and then subsequently providing harder examples. Several follow up works have shown this type of learning to be successful. Cyclic learning rates Smith (2017) combine curriculum learning with simulated annealing to modify the learning rate between a range and achieve faster convergence. Smith et. al. Smith and Topin (2019) further extend this to get superconvergence using large learning rates. Teacherstudent curriculum learning Matiisen et al. (2019) uses a teacher to assist the student in learning the tasks at which the student makes the fastest progress. Curriculum loss Lyu and Tsang (2019) introduces a modified loss function to select easy examples automatically.
The above works on hierarchical multilabel classification are in general empirical and require careful study of the application domain. In this work, we propose a hierarchical classbased curriculum loss with theoretical analysis and provable bounds rendering it free of hyperparameters. Further, we propose a classbased curriculum loss to enhance the performance of hierarchically constrained loss.
3 Hierarchical ClassBased Curriculum Loss
For the task of multiclass classification, given a multiclass loss function, our goal is to incorporate the hierarchical constraints present in the label space into the given loss function. Note that we consider the general problem of multilabel multiclass classification of which a single label multiclass classification is an instance, thus yielding our method applicable to a wide variety of problems. In this section, we first define the general formulation of multiclass multilabel classification. We then define a hierarchical constraint which we require to be satisfied for a hierarchical label space. We then introduce our formulation of a hierarchically constrained loss and show that the proposed loss function indeed satisfies the constraint. We then prove a bound on the proposed loss. We extend the loss function to implicitly use a curriculum learning paradigm and show a tight bound to 01 loss using this. Finally, we present our algorithm to train the model using the our loss function.
3.1 Incorporating Hierarchical Constraints
Consider the learning framework with training set with training examples and input image features . We represent the labels as where is the number of classes and means that the example belongs to class. A 01 multiclass multilabel loss with this setting can be defined as follows:
(1) 
Let the set of classes be arranged in a hierarchy defined by a hierarchy mapping function which maps a category to its children categories. We use the function to denote the mapping from a category to its level in the hierarchy. We now define the following hierarchical constraint on a generic loss function , the satisfaction of which would yield the loss function :
(2) 
The constraint implies that the loss increases monotonically with the level of the hierarchy i.e. loss of higher (i.e. closer to the root) levels in the hierarchy is lesser than that of the lower levels (i.e. closer to the leaves). The intuition is that identifying categories in higher level is easier than categories in lower level as they are coarser. Its violation would mean that the model is able to differentiate between finergrained classes more easily compared to coarse classes which is counterintuitive.
We now propose , a variant of the 01 loss , and show that it satisfies the hierarchical constraint :
(3) 
We now show the following result for :
Theorem 1.
Proof.
Let us assume that doesn’t follow hierarchical constraint i.e.
(4) 
However, we have
(5) 
∎
which contradicts the assumption in Eq. 4. This concludes that follows hierarchical constraint .
Now, we show that the hierarchically constrained loss function is tightly bounded to the base function:
Theorem 2.
(Bound on Constrained 01 Loss) For a 01 loss function , the loss function defined in Equation 3 is an elementwise tight bound on with constraint . Let denote elementwise inequality i.e. means . We then have:
(6) 
Proof.
Let us assume that s.t. i.e.
(7) 
As is a 01 loss function, this implies the following:
(8) 
Substituting definition of from equation 3, from the second condition we get
(9) 
This leads to two cases:
Case 1: If , then as , we have which violates the condition.
Case 2: If , then s.t. . Since , we have for . However, which violates constraint.
Thus, we get . ∎
The above results can be generalized for any base loss function as follows:
Corollary 1.
(Bound on Constrained Generic Loss) For any loss function , the loss function defined below is an elementwise tight bound on with constraint i.e.
(10) 
the following relation holds:
(11) 
We have shown above that the hierarchy constrained loss function provides an elementwise tight bound on the base loss . We now extend this loss function to use a curriculum learning paradigm and show that the loss is a tighter bound to 01 loss compared to any other loss satisfying hierarchical constraints.
3.2 Hierarchical Curriculum Loss
As shown by Hu et. al. Hu et al. (2016), 01 loss ensures that the empirical risk has a monotonic relation with adversarial empirical risk. However, it is nondifferentiable and difficult to optimize. Following the groundwork by Lyu et. al. Lyu and Tsang (2019) who propose example based curriculum loss, we present a classbased curriculum loss for any given loss function following the hierarchical constraint in the following theorem. The theorem also proves that the function defined is tighter bound to 01 loss compared to any loss function which satisfies the hierarchical constraint and is lower bounded by . Note that a general loss function is elementwise lower bounded by 01 loss i.e. .
Theorem 3.
(Hierarchical ClassBased Curriculum Loss) For a general hierarchy constrained loss function , we define the loss function as follows:
(12) 
Then, i.e. the following holds
(13) 
(14) 
(15) 
Proof.
Consider
(16) 
For the lower bound we have,
(17) 
We thus have . Subtracting from both sides, we get the theorem. ∎
3.3 Algorithm
In the above theorem, we prove that the proposed hierarchical classbased curriculum loss provides a tighter bound to 01 loss compared to the hierarchically constrained loss function. Given the above, we now need to find the optimal class selection parameters for each class. We show that Algorithm 1 provides the optimal selection parameters:
Theorem 4.
The proof for above is provided in the Appendix. Note that the time complexity of above algorithm is and is thus computationally inexpensive as the number of classes doesn’t typically go over orders of thousands.
4 Experiments
We first perform an ablation study on each of the component of hierarchical classbased curriculum loss, including the hierarchically constrained loss and classbased curriculum loss, and show how they interplay to provide the final loss function. We then compare our loss function against the stateoftheart losses to show its performance gain.
4.1 Experimental Setup
We evaluate our loss function on two real world image data sets – Diatoms Dimitrovski et al. (2012) and IMCLEF Dimitrovski et al. (2011)
. Diatoms data set contains 3,119 images of diatoms (a large and ecologically important group of unicellula or colonial organisms (algae)). Each diatom can correspond to one or many of the categories arranged in a hierarchy. Overall, there are 399 categories in this data set arranged into a hierarchy of height 4 containing 47 categories. On the other hand, IMCLEF data set contains images of xrays of human bodies and classes correspond to the body parts arranged in a hierarchy of height 4. For both these data sets, we use a pre extracted feature set, extracted using techniques of image segmentation including Fourier transforms and histograms of local SIFT descriptors.
For evaluation, we use a multilayer perceptron with the above extracted features as input and the categories as output. We select the hyperparameters of the neural network using evaluation on a validation set with binary cross entropy loss. Based on this, we get a structure with 800 hidden neurons and a dropout of 0.25. Note that we fix this network for all the baseline loss functions and our loss function to ensure fair comparison of results. We compare the hierarchical classbased curriculum loss with the following stateoftheart losses – (i) binary cross entropy loss
Goodfellow et al. (2016), (ii) focal loss Lin et al. (2017) and (iii) hierarchical cross entropy loss Bertinetto et al. (2019). Further, we also compare it with SoftLabels Bertinetto et al. (2019), which modifies the ground truth labels in accordance with the hierarchy.We use the following metrics to evaluate each of the losses for the classification task – (i) Hit1, (ii) MRR (Mean Reciprocal Rank) Radev et al. (2002), and (iii) HierDist Deng et al. (2010). The first three metrics capture the accuracy of ranking of the model predictions and hierarchy capturing methods often show lower performance compared to nonhierarchical methods as the losses get more constrained. Hierarchical methods often show improvements on a metric which captures how close to the ground truth class the prediction is in the given hierarchy.
Our final metric, HierDist, captures this and can be defined as the minimum height of the lowest common ancestor (LCA) between the ground truth labels and the top prediction from the model. Mathematically, for a data point , it can be defined as follows:
where denotes the hierarchy of the labels. Note that as pointed out by Deng Deng et al. (2010), the metric is effectively on a log scale. It is measured by the height in the hierarchy of the lowest common ancestor, and moving up a level can more than double the number of descendants depending on the fan out of the parent class (often greater than 34). We show that our loss function is superior to the baseline losses for this metric. In addition, our model’s performance also doesn’t deteriorate on nonhierarchical metrics.
4.2 Ablation Studies
Methods  Diatoms  IMCLEF  

Hit@1  MRR  HierDist  Hit@1  MRR  HierDist  
CrossEntropy  74.45  85.19  1.26  90.45  93.33  0.35 
HCLHier  75.12  81.46  1.23  90.45  93.33  0.33 
HCLCL  74.93  81.34  1.24  90.65  93.34  0.32 
HCL  75.21  81.56  1.22  90.95  93.49  0.22 
We show the effects of the hierarchical constraints and the curriculum loss using cross entropy loss as the base function in Table 1. The results in the data set show evaluation on a held out test data.
Consider the hierarchically constrained cross entropy loss, HCLHier. For Diatoms, we observe that the loss function gives improvement on the hierarchical metric significantly decreasing the hierarchical distance from 1.26 to 1.23. Further, we see that the nonhierarchical metrics also improve compared to the baseline loss. For IMCLEF, we observe that the nonhierarchical metrics stay similar to the cross entropy baseline but the loss gives improvement on the hierarchical distance metric. This shows that our formulation of adding the hierarchical constraint doesn’t deteriorate typical metrics all the while making the predictions more consistent with the label space hierarchy as well as making predictions closer to the ground truth label in the hierarchical space. We believe that the improvement in nonhierarchical metrics may be attributed to the modified ranking due to hierarchical constraint making the top predictions more likely.
Looking into classbased curriculum loss individually, we observe that as suggested by theory, carefully selecting the classes based on the their loss (implicitly giving more weight to simpler classes) improves the hierarchical evaluation metric, making the predictions more relevant. Further, as this way of learning also makes the learned model function smoother, we observe that in both the data sets the nonhierarchical metrics also improve significantly with respect to baseline. In both the data sets, we observe similar trends of getting improvement in all the metrics.
Overall, combining the two aspects yields us HCL which shows the best performance. We see that for all the metrics, HCL gives significant gains both with respect to baseline loss and individual components. This follows theory in which we have show that combining classbased curriculum loss with hierarchical constraints gives a tighter bound to 01 loss with respect to the hierarchically constrained loss. Further, as this loss explicitly ensures that loss of a higher level node is lower than the a lower level, and implicitly gives more weight to higher level node as they are selected more, the combined effect makes it more consistent to the hierarchy. This can be particularly seen with IMCLEF data set, for which each individual component gave good improvements for the hierarchical metric but the combined loss gave much more significant gain. We also note that Diatoms has 399 classes compared to 47 classes of IMCLEF with similar number of levels making the number of categories in each layer around 8 times higher making the gains in HierDist more difficult to attain.
4.3 Results
Methods  Diatoms  IMCLEF  

Hit@1  MRR  HierDist  Hit@1  MRR  HierDist  
CrossEntropy  74.45  81.00  1.26  90.45  93.33  0.35 
FocalLoss  74.36  80.47  1.27  90.85  93.60  0.26 
HierCE  75.12  81.51  1.24  90.85  93.63  0.24 
SoftLabels  72.36  74.95  1.38  90.45  92.07  0.33 
HCL  75.21  81.56  1.22  90.95  93.49  0.22 
We now compare our proposed loss (HCL) with the stateoftheart loss functions capturing hierarchy as well as a label embedding method (SoftLabels). From Table 2, we observe that our loss significantly outperforms the base loss functions. Consistent with the results presented by Bertinetto et al.Bertinetto et al. (2019), we see that previously proposed hierarchically constrained loss functions especially SoftLabels improve on hierarchical metrics but the performance deteriorates for nonhierarchical metrics. On the other hand, HCL’s performance improves or stays comparable to baselines on nonhierarchical metrics, while getting significant gains on the hierarchical metric over stateoftheart loss functions. We observe that HierCE is the best performing baseline but our model outperforms it on every metric except MRR on IMCLEF. As pointed above, the gains in Diatoms are more difficult to obtain given the number of classes in each level but our model gives visible gains on it as well.
5 Conclusion
In this paper, we proposed a novel loss function for multilabel multiclass classification task ensuring the predictions are consistent with the hierarchical constraints present in the label space. For hierarchical loss, we proposed a formulation to modify any loss function to incorporate hierarchical constraint and show that this loss function is a tight bound on the base loss compared to any other loss satisfying hierarchical constraints. We next proposed a classbased curriculum loss, which implicitly gives more weight to simpler classes rendering the model smoother and more consistent with the hierarchy as shown experimentally. We combined the two components and theoretically showed that the combination provides a tighter bound to 01 loss, rendering it more robust and accurate. We validated our theory with experiments on two multilabel multiclass hierarchical image data sets of human xrays and diatoms. We showed that the results are consistent with the theory and show significant gains on the hierarchical metric for the data sets. Further, we observed that our models also improve on nonhierarchical metrics making the the loss function more widely applicable. In the future, we would like to relax the hierarchical constraints and develop a loss for a general graph based relation structure between the labels. Further, we would like to the test the model on other real world data sets which contain relation between labels. Finally, we would also like to test the model performance when we introduce noise into the hierarchy and the labels.
References
 [1] (2017) Initializing neural networks for hierarchical multilabel text classification. In BioNLP 2017, Cited by: §2.
 [2] (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §2.
 [3] (2019) Making better mistakes: leveraging class hierarchies with deep networks. arXiv preprint arXiv:1912.09393. Cited by: §2, §2, §4.1, §4.3.

[4]
(2017)
Do convolutional neural networks learn class hierarchy?
. IEEE transactions on visualization and computer graphics 24 (1), pp. 152–162. Cited by: §2.  [5] (2014) Hierarchical multilabel classification using local neural networks. Journal of Computer and System Sciences 80 (1), pp. 39–56. Cited by: §2.

[6]
(2019)
Twostage label embedding via neural factorization machine for multilabel classification.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 3304–3311. Cited by: §2.  [7] (2009) Ranking measures and loss functions in learning to rank. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems., Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 315–323. Cited by: §2.

[8]
(2010)
What does classifying more than 10,000 image categories tell us?.
In
European conference on computer vision
, pp. 71–84. Cited by: §2, §4.1, §4.1.  [9] (2011) Hierarchical annotation of medical images. Pattern Recognition 44 (1011), pp. 2436–2449. Cited by: §4.1.
 [10] (2012) Hierarchical classification of diatom images using ensembles of predictive clustering trees. Ecological Informatics 7 (1), pp. 19–29. Cited by: §4.1.
 [11] (2013) Devise: a deep visualsemantic embedding model. In Advances in neural information processing systems, Cited by: §2.

[12]
(2016)
Contextual LSTM (CLSTM) models for large scale NLP tasks.
In
KDD Workshop on Largescale Deep Learning for Data Mining (DLKDD)
, Cited by: §2.  [13] (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §4.1.
 [14] (2018) Graph embedding techniques, applications, and performance: a survey. KnowledgeBased Systems 151, pp. 78 – 94. External Links: ISSN 09507051, Document, Link Cited by: §2.
 [15] (2017) Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: §2.

[16]
(2016)
Does distributionally robust supervised learning give robust classifiers?
. arXiv preprint arXiv:1611.02041. Cited by: §3.2.  [17] (2017) Costsensitive label embedding for multilabel classification. Machine Learning 106 (910), pp. 1725–1746. Cited by: §2.
 [18] (2018) Multilabel classification using hierarchical embedding. Expert Systems with Applications 91, pp. 263–269. Cited by: §2.
 [19] (2019) Efficient incremental learning for mobile object detection. In Proceedings of the ACM/IEEE Symposium on Edge Computing (SEC), Cited by: §2.
 [20] (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.1.
 [21] (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
 [22] (2019) Curriculum loss: robust learning and generalization against label corruption. arXiv preprint arXiv:1905.10045. Cited by: §2, §3.2.
 [23] (2018) Awx: an integrated approach to hierarchicalmultilabel classification. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 322–336. Cited by: §2.
 [24] (2019) Teacherstudent curriculum learning. IEEE transactions on neural networks and learning systems. Cited by: §2.

[25]
(2013)
Efficient estimation of word representations in vector space
. arXiv preprint arXiv:1301.3781. Cited by: §2. 
[26]
(2019)
Label embedding using hierarchical structure of labels for twitter classification.
In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)
, pp. 6318–6323. Cited by: §2.  [27] (2014) Glove: global vectors for word representation.. In EMNLP, Vol. 14, pp. 1532–43. Cited by: §2.
 [28] (2002) Evaluating webbased question answering systems.. In LREC, Cited by: §4.1.
 [29] (2019) Taking a hint: leveraging explanations to make vision and language models more grounded. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2591–2600. Cited by: §2.
 [30] (2019) Superconvergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for MultiDomain Operations Applications, Vol. 11006, pp. 1100612. Cited by: §2.
 [31] (2017) Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. Cited by: §2.
 [32] (2012) Learning hierarchical similarity metrics. In 2012 IEEE conference on computer vision and pattern recognition, pp. 2280–2287. Cited by: §2.
 [33] (2018) Hierarchical multilabel classification networks. In International Conference on Machine Learning, pp. 5075–5084. Cited by: §2.
 [34] (2016) Learning to make better mistakes: semanticsaware visual food recognition. In Proceedings of the 24th ACM international conference on Multimedia, pp. 172–176. Cited by: §2.
 [35] (2019) Generative visual dialogue system via adaptive reasoning and weighted likelihood estimation. In Proceedings of IJCAI, Cited by: §2.
 [36] (2020) Classincremental learning via deep model consolidation. In IEEE Winter Conference on Applications of Computer Vision, WACV, pp. 1120–1129. Cited by: §2.