Log In Sign Up

Learning with Hierarchical Complement Objective

by   Hao-Yun Chen, et al.

Label hierarchies widely exist in many vision-related problems, ranging from explicit label hierarchies existed in image classification to latent label hierarchies existed in semantic segmentation. Nevertheless, state-of-the-art methods often deploy cross-entropy loss that implicitly assumes class labels to be exclusive and thus independence from each other. Motivated by the fact that classes from the same parental category usually share certain similarity, we design a new training diagram called Hierarchical Complement Objective Training (HCOT) that leverages the information from label hierarchy. HCOT maximizes the probability of the ground truth class, and at the same time, neutralizes the probabilities of rest of the classes in a hierarchical fashion, making the model take advantage of the label hierarchy explicitly. The proposed HCOT is evaluated on both image classification and semantic segmentation tasks. Experimental results confirm that HCOT outperforms state-of-the-art models in CIFAR-100, ImageNet-2012, and PASCAL-Context. The study further demonstrates that HCOT can be applied on tasks with latent label hierarchies, which is a common characteristic in many machine learning tasks.


Complement Objective Training

Learning with a primary objective, such as softmax cross entropy for cla...

Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU

The state-of-the-art object detection and image classification methods c...

Learning Representations For Images With Hierarchical Labels

Image classification has been studied extensively but there has been lim...

Class2Str: End to End Latent Hierarchy Learning

Deep neural networks for image classification typically consists of a co...

Hierarchical Image Classification using Entailment Cone Embeddings

Image classification has been studied extensively, but there has been li...

Embedding Semantic Hierarchy in Discrete Optimal Transport for Risk Minimization

The widely-used cross-entropy (CE) loss-based deep networks achieved sig...

Epoch-evolving Gaussian Process Guided Learning

In this paper, we propose a novel learning scheme called epoch-evolving ...

1 Introduction

(a) Baseline (cross-entropy)
(b) COT
(c) HCOT
Figure 4: Sorted predicted probabilities (denoted as ) from three different training paradigms evaluated on CIFAR-100 dataset using PreAct ResNet-18. The red bar indicates the probability of the ground-truth (denoted as ), the green bars are the probabilities of classes in the same parental category as the ground-truth (denoted as ), and blue bars are the probabilities of the rest classes (denoted as , see Sec. 3 for detailed notation definition). Notice the “staircase shape” in (c) showing the significant difference between and , and then between and , which confirms HCOT well captures the label hierarchy.

Many machine learning tasks involve making predictions on classes that have an inherent hierarchical structure. One example would be image classification with hierarchical categories, where a category shares the same parental category with other ones. For example, the categories with label “dog” and “cat” might share a common parental category “pet”, which forms a explicit label hierarchy. Another example would be in the task of semantic segmentation, where “beach”, and “sea” are under the same theme “scenery” which forms a latent label hierarchy, while “people”, and “pets” forms another one of “portrait.” In this work, we call a parental category a coarse(-level) category, while a category under a coarse category is called a fine(-level) category.

Many successful deep learning models are built and trained with cross-entropy loss that assumes prediction classes to be mutually independent. This assumption works well for many tasks such as traditional image classifications where no hierarchical information is present. In the explicitly hierarchical setting, however, one problem is that learning with objectives that pose such a strong assumption makes the model difficult to utilize the hierarchical structure in the label space. Another challenge in modeling hierarchical labels is that many tasks sometime exhibit latent label hierarchy. Take semantic segmentation for example, an inherent hierarchical structure has been explored by 

[32] as “’global context”. However, the dataset itself does not contain hierarchical information.

In this paper, we develop a technique that is capable of leveraging the information in a label hierarchy, through proposing a new training objective. Our proposed technique is different from previous methods [9, 19, 28, 32] which exploit the label hierarchy by changing model architectures but not the objectives. The general idea we propose is to penalize incorrect classes at different granularity levels: the classes that are “obviously wrong”—different from not only the ground truth but also the parental category of ground truth—should receive larger penalty than the ones that share the same parental categories of ground truth. Such a mechanism allows us to take advantage of the information in the label hierarchy during training.

To achieve this goal of training with hierarchy information, we introduce the concept of Complement Objective Training (COT) [2, 3] into label hierarchy. In COT, the probability of the correct class is maximized by a primary objective (i.e., cross-entropy), while the probabilities of incorrect classes are neutralized by a complement objective [3]. This training paradigm aims at widening the gaps between the predicted probability value of the ground truth and those of the incorrect classes. In this paper, we propose Hierarchical Complement Objective Training (HCOT) with a novel complement objective called “Hierarchical Complement Entropy” (defined in Sec. 3), by applying the idea of the complement objective on both the fine-level class and its corresponding coarse-level class.

HCOT learns the class probabilities by three folds: (a) maximizing the predicted probability of ground truth, (b) neutralizing the predicted probabilities of incorrect classes sharing the same coarse-level category as the ground truth, and (c) further penalizing others that are on different branches (in the label hierarchy) to the ground-truth class. Figure 4 illustrates the general idea of HCOT compared to cross-entropy and COT, which shows HCOT leads to both confident prediction for the ground-truth class and the predicted distribution that better reflects the label hierarchy (and therefore closer to the true data distribution). Particularly, the probability mass of the classes belonging to the parental category of the ground truth (in green) to be significantly higher than the rest of the classes (in blue). In other words, the model is trained to strongly penalize the “obviously wrong” classes that are completely irrelevant to both the ground-truth class and other classes belonging to the same parental category.

We conduct HCOT on two important problems: image classification and semantic segmentation. Experimental results show that models trained with the Hierarchical complement entropy achieve significantly better performance over both cross-entropy and COT, across a wide range of state-of-the-art methods. We also show that HCOT improves model performance when predicting the coarse-level classes. And finally, we show that HCOT can deal with not only tasks with explicit label hierarchy but also those with latent label hierarchy. To the best of our knowledge, HCOT is the first paradigm that trains deep neural models using an objective to leverage information from a label hierarchy, and leads to significant performance improvement.

2 Background

Learning Label Hierarchy.

To leverage label hierarchies from data has been explored for general purposes for years. [7] tried to exploit discriminative properties from label hierarchies by regularization layers. [1]

studied how label hierarchies can help deep neural networks by utilizing the confusion patterns of fine categories to follow a hierarchical structure over the classes. For fine-grained image classification, 

[27] augments the fine-grained data with auxiliary images labeled by coarse classes, which exploits a regularization between fine-grained recognition models and coarse-grained recognition models. However, the above-mentioned approaches are usually not compatible with the cutting-edge deep models [10, 12] and data augmentations [6, 31] proposed in recent years. State-of-the-art methods in hierarchical problems have since then tend to adopt methods that ignore hierarchical information during training.

Explicit Label Hierarchy.

Many tasks exhibit explicit label hierarchy that are presented as part of the dataset. Explicit hierarchical structures exist among the class labels for a wide range of problems. Taking visual recognition as an example, there have been many prior arts on non-neural models focused on exploiting the hierarchical structure in categories [24]. For neural models, HD-CNN [28]

is an early work using the category hierarchy to improve performance over the flat N-way deep-network classifiers. The network architecture of HD-CNN contains a coarse component and several fine-grained components for learning from labels of different levels. Unlike HD-CNN which uses one fixed model, Blockout 

[19] uses a regularization framework that learns both the model parameters and the sub-networks within a deep neural network, to capture the information in a label hierarchy. Another prior art named CNN-RNN [9]

combines the CNN-based classifier with a Recurrent Neural Network to exploit the hierarchical relationship, sequentially from the coarse categories to the fine ones. All of the above-mentioned approaches rely on modifying model architectures to capture the hierarchical structures among the class labels. This raises an intriguing question: Is it possible to design a training objective, rather than proposing a new model architecture, for a deep neural network to effectively capture the information contained in a label hierarchy?

Latent Label Hierarchy.

Another group of tasks are rather exclusive on the hierarchical information but has an underlying assumption on an inherent label structure. Semantic segmentation is one of such tasks where co-occurrence of the class labels forms a latent label hierarchy. This hierarchy is not directly observed in the data but can be inferred from the data. In semantic segmentation, the goal is to assign a semantic label to each pixel of an image. Typically, when training a deep network model for semantic segmentation, the information of individual pixels are usually taken in isolation. That is, the per-pixel cross-entropy loss is calculated for an image, with respect to the ground truth labels. To consider the global information, EncNet [32] first utilizes the semantic context of scenes by exploiting model structures and provides a strong baseline in semantic segmentation. However, we argue that the potential of leveraging global information on the labeling space is still not discovered.

3 Hierarchical Complement Objective Training

In this section, we introduce the proposed Hierarchical Complement Objective Training (HCOT), which is a new training paradigm for leveraging information in a label hierarchy. Specifically, a novel training objective, Hierarchical Complement Entropy (HCE), is defined as the complement objective for HCOT. In the following, we first review the concept of the complement objective, and then provide the mathematical formulation of HCE.

Complement Objective.

In Complement Objective Training (COT) [3], a neural model is trained with both a primary objective and a complement objective: the primary objective (e.g., cross-entropy) is for maximizing the predicted probability of the ground-truth class, whereas the complement objective (e.g., complement entropy [3]) is designed to neutralize the predicted probabilities of incorrect classes, which intuitively makes a model more confident about the ground-truth class. Eq(1) gives the definition of the complement entropy:


where is the total number of samples, is the set of labels. For the sample,

is the vector of logits for the sample. Let

be the corresponding ground-truth class for the sample, so represents the set of incorrect classes. We use to annotate the Shannon entropy function [22] over the probability , defined below.


where , and the probability function is defined as the output of the softmax function:


Intuitively, is the dimension of a multinomial distribution normalized among the incorrect classes over logits (that is, excluding the probability mass of the ground-truth class). Please note that the alternative definition of complement entropy is mathematically equivalent to the one presented in [3].

Despite the good performance by maximizing Complement entropy to make complement events equally like to occur, this approach do not consider the generalization gap between predicted distributions and true data distributions. For example, if the ground-truth class is “dog”, to flatten the predicted probabilities on irrelevant classes such as “cat” and “truck” is counter-intuitive.

Hierarchical Complement Entropy.

The proposed Hierarchical Complement Entropy (HCE) regulates the probability masses, similar to what the complement entropy does, but in a hierarchical fashion. Let a subgroup be a set that contains the sibling classes that belong to the same parental class of the ground-truth class, that is, and . HCE will first regulate complement entropy between the subgroup and the ground truth followed by the complement entropy between label space and subgroup . Detailed definition can be found in Eq(3). The proposed HCE is defined as the following with being the model parameters:


It is not hard to see that Eq(4) is a direct implementation of the predicted probabilities trained with HCOT procedure in Figure 4, which impose probability regulation based on the hierarchical structure of the labels. regulates inner hierarchy, which corresponds to the relationship between the probability masses marked as red and green. The second term, , regulates the outer hierarchy, which corresponds to the relationship between the green and blue class labels. Hierarchical complement entropy ensures that the gaps between each of the hierarchies are as wide as possible to enforce the hierarchical structure during training. In the extreme case when , the second term in Eq(4) disappears and the Hierarchical complement entropy degenerates to Complement entropy.


Our loss function consists of two terms: the normal cross entropy term (i.e.,

), and the complement objective term .


In Direct optimization, we simply add two objectives together to be Eq(5) and directly optimize it using SGD without any hyper-parameter tuning or balancing. An alternative approach is Alternative optimization, which optimizes the cross-entropy term and the complement objective term interleaved. This is done by maximizing HCE followed by minimizing XE for a single training iteration, which follows [3] to have fair comparisons. In our paper, we choose between these two methods to achieve the best performance for our models.

4 Image Classification

(a) Embeddings trained with cross-entropy
(b) Embeddings trained with HCOT
Figure 7: Embeddings from 20 coarse classes of CIFAR-100 test images. The embedding of each sample is from the penultimate layer and projected to two dimensions (by t-SNE) for visualization. Notice in (b) the clusters are more distinct, with cleaner and well-separated boundaries, by which we conjecture that the model generalizes better.

In this section, we evaluate HCOT on image classification tasks. Experiments are conducted with two widely-used datasets that contain label hierarchy: CIFAR-100 [14] and ImageNet-2012 [15].

We conduct extensive experiments on CIFAR-100 dataset to study several aspects of HCOT:

  • Does HCOT improve the performance over the state-of-the-art models?

  • Can HCOT work in synergy with other commonly-used regularization techniques such as Mixup and Cutout?

  • Does HCOT improve the classification accuracy of coarse classes over the state-of-the-art models?

  • How will HCOT affect the latent representation (embedding) of a model?

In addition, we also perform a side-by-side comparison between HCOT and one state-of-the-art—CNN-RNN [9]— that also uses label hierarchy to train image classifiers. The experimental results confirm the proposed HCOT better captures label structure and learns a more accurate model.

4.1 Cifar-100

CIFAR-100 is a dataset consisting of 60k colored natural images of 32x32 pixels equally divided into 100 classes. There are 50k images for training and 10k images for testing. The official guide CIFAR-100 [14] further groups the 100 classes into 20 coarse classes where each coarse class contains five fine classes, forming the label hierarchy. Therefore, each image sample has one fine label and one coarse label. Here we follow the standard data augmentation techniques [10]

to preprocess the dataset. During training, zero-padding, random cropping, and horizontal mirroring are applied to the images. For the testing images, we use the original images of


Experimental Setup.

For CIFAR-100, we follow the same settings as the original ResNet paper [10]. Specifically, the models are trained using SGD optimizer with momentum of 0.9; weight decay is set to be 0.0001 and learning rate starts at 0.1, then being divided by 10 at the 100th and 150thepoch. The models are trained for 200 epochs, with a mini-batch size of 128. For training WideResNet, we follow the settings described in [30], and the learning rate is divided by 10 at the 60th, 120th and 180th epoch. In addition, no dropout [23] is applied to any baseline according to the best practices in [13]. We follow alternating training  [3], where models are trained by alternating between the primary objective (i.e., cross-entropy) and the complement objective (i.e., Hierarchical Complement Entropy).


Our method demonstrates improvements over all of the state-of-the-art models compared to baseline and COT, improving error rates by a significant margin. These models range from the widely used ResNet to the SE-ResNet [12], which is the winner of the ILSVRC 2017 classification competition. SE-ResNet considers novel architecture units named Squeeze-and-Excitation block (SE block) in ResNet framework for explicitly capturing the inter-dependencies between channels of convolutional layers. Results are shown in Table 1.

Model Baseline COT HCOT
ResNet-56  [10] 29.41 27.76 27.3
ResNet-110  [10] 27.93 27.24 26.46
SE-ResNet-56  [12] 28.11 27.04 26.54
SE-ResNet-110  [12] 26.49 26.09 25.49
PreAct ResNet-18  [11] 25.44 24.73 23.8
ResNeXt-29 (264d)  [26] 23.45 21.9 21.64
WideResNet-28-10  [30] 21.91 20.99 20.32
Table 1: Error rates (%) on CIFAR-100 using ResNet, SE-ResNet, and variants of ResNet.

Results with Mixup and Cutout.

We also show that HCOT can be applied in synergy with other commonly-used techniques to further improve model performance. We conduct experiments on ResNet-110 with “Cutout” [6] for input masking and “Mixup ” [31] for data augmentation. Table 2 shows the accuracy of models trained with HCOT consistently outperform the baseline and the models trained with COT.

Model Baseline COT HCOT
ResNet-110 + Cutout 24.61 23.93 23.85
ResNet-110 + Mixup 24.46 23.82 23.33
Table 2: Error rates (%) on CIFAR-100 using ResNet with Cutout and Mixup techniques.

Analysis on Coarse-level Labels.

To understand the places where performance improvements of HCOT coming from, we show the results by splitting them into coarse and fine labels in Table 3. Here we see that HCOT improves the performance significantly on the coarse-level labels, where COT hardly improves. Such a performance improvement is a direct result of modeling label hierarchies, which is not taken into account in either baseline or COT. Surprisingly, HCOT also improves fine-level labels significantly, over the already improved results of COT. This suggests that modeling of the fine-level labels can benefit from modeling label hierarchies.

Label Baseline COT HCOT
Coarse 15.08 15.05 14.02
Fine 24.21 23.33 22.64
Table 3: Error rates (%) on both coarse and fine classes on CIFAR-100 using SE-PreAct ResNet-18.

Embedding Space Visualization.

A visualization of logits of the coarse-level labels are shown in Figure 7. Here we compare it against the visualisation from the baseline SE-PreAct ResNet-18 [12] trained using cross-entropy. Compared to the baseline, the HCOT seems to form more distinct clusters in the embedding space that have clear separable boundaries, by which we conjecture that the model generalizes better and therefore achieves better performance.

Comparison with CNN-RNN.

To demonstrate the proposed HCOT effectively leverages label hierarchy, we compare the proposed HCOT with another state-of-the-art—CNN-RNN [9]— that also leverages label hierarchy for training an image classifier. Specifically, CNN-RNN framework is also proposed to take advantage of label hierarchy using an novel neural architecture: combining CNN with RNN. CNN is in charge of extracting discriminative features from images and RNN enables the joint optimization by using coarse and fine labels. In the CNN-RNN framework, WideResNet-28-10 (denoted as WRN) has been selected as the base model and another RNN is constructed upon the WRN. For a fair comparison, we evaluate HCOT on the same WRN architecture, and the experimental results are provided in Table 4. The proposed HCOT achieves significantly better accuracy than WRN-RNN, confirming that the proposed HCE effectively captures the information from label hierarchy. Also, notice WRN-HCOT is more parameter efficient—WRN-HCOT requires less parameters than WRN-RNN since WRN-RNN requires a whole RNN on top of WRN.

Top-1 Error 21.57 20.32
Table 4: Error rates (%) on CIFAR-100 using “WRN-RNN” and “WRN trained with the proposed HCOT training paradigm” (denoted as WRN-HCOT).

4.2 ImageNet-2012

ImageNet-2012 [15] is a large-scale dataset for image classification with 1k fine categories. This dataset consists of approximately 1.3 million training images and 50k validating images, and each image has pixels. In addition, the image labels of ImageNet-2012 are from the “leaf classes” of WordNet [17]; WordNet is a lexical database for the English language, which organizes words into hierarchies defined by hypernym or IS-A relationships. We follow the prior art on object detection (YOLO9000 [20]) to construct hierarchies for labels in ImageNet-2012. Specifically, leaf classes which belong to the same sub-tree are grouped together, and their parental synsets are extracted as the parental classes, forming a two-level hierarchy containing synsets as parental classes and leaf (or fine-grained) classes. As a matter of fact, many literature [9, 21] that aims at improving ImageNet-2012 classification use similar pre-processing steps to construct a tree-based hierarchy into two-level hierarchy.

Experimental Setup.

To prepare for experiments, we apply random crops and horizontal flips during training, while images in the testing set use center crops (1-crop testing) for data augmentation [10]. We follow [8] as our experimental setup: 256 minibatch size, 90 total training epochs, and 0.1 as the initial learning rate starting that is decayed by dividing 10 at the 30th, 60th and 80th epoch. We use the same alternating training as we did in the CIFAR-100 dataset [3].


As the main result, we conduct HCOT with 52 coarse categories. Results in Table 5 shows significant improvements on both top-1 and top-5 error rates compared to COT and the baseline (ResNet-50 using cross-entropy). We note that top-5 error in-explicitly tests the model’s abilities for hierarchical labels.

Baseline COT HCOT
Top-1 Error 24.7 24.4 24.0
Top-5 Error 7.6 7.4 7.1
Table 5: Validation error rates (%) on ImageNet-2012 using ResNet-50.
(a) Real image
(b) Ground truth
(c) EncNet+JPU
(d) EncNet+JPU+HCOT
Figure 12: Examples of segmentation results in PASCAL-Context. Notice the main objects (e.g., bird, horses, building, and dogs) in each image are well-segmented in (d) and visually much closer to the ground truth in (b). We believe that HCOT retains the latent label hierarchies so the segmentation is clearer and without many irrelevant semantics.

Ablation study.

To explore the effect to HCOT over different granularity of the coarse classes, we conduct a study on performance of our model over a range of coarse classes = {1, 20, 52, 145, 1000} on ImageNet-2012. We observe that HCOT perform best over the sufficient information of category hierarchies. In this case, the performance on Top-1 error peaks at . When the goes to the extremes (i.e., 1 or 1000), HCOT degrades to COT. This can be illustrated in Eq(4). When , the left term in the equation will disappear, making it the same as COT. Similarly, when , the right term will disappear as there are only 1000 classes.

1 20 52 145 1000
Top-1 Error 24.4 24.3 24.0 24.2 24.4
Table 6: Validation error (%) of different numbers of coarse classes on ImageNet-2012 using ResNet-50.

5 Semantic Segmentation

In the task of semantic segmentation, there is latent, hierarchical information contained among the labels [32]. In this task, label hierarchy is not defined or given explicitly, but are rather inferred from the dataset. Applying HCOT to this task can make effective use of this inferred information in the label space. In particular, the proposed HCOT procedure can achieve both high confidence of ground-truth and attention of global scene information for each label, which maintains the hierarchy between each semantic and the corresponding theme in a same image sample and helps to provide more accurate semantic segmentation.

Method PixAcc mIoU XE 0.7835 49.70 COT 0.7844 49.65 HCOT 0.7862 49.86 (a) EncNet Method PixAcc MIoU XE 0.7880 51.05 COT 0.7884 51.07 HCOT 0.7918 51.35 (b) EncNet+JPU
Table 9: Segmentation results of models trained with cross-entropy (denoted as XE) versus COT and HCOT on PASCAL-Context dataset.


We apply HCOT on the widely-used “Pascal-Context” dataset [18]. Each image in the PASCAL-Context dataset has dense, semantic labels over the entire scene of the image. The dataset contains 4,998 images for training and 5,105 for testing. We follow the prior arts [4, 16, 18] and create a set of 60 semantic labels for segmentation. These 60 semantic labels represent the most frequent 59 object categories, plus the “background” category.

Experimental Setup.

We first take EncNet (Context Encoding Module) [32] to be the baseline. Here we follow the previous works [5, 29, 33] to use the dilated network strategy on the pretrained ResNet-50. In addition, we perform the Joint Pyramid Upsampling (JPU) [25] instead of the dilated convolution over EncNet (denoted as “EncNet+JPU”) to reproduce the state-of-the-art results in semantic segmentation. JPU can formulate the task of extracting high resolution feature maps into a joint upsampling problem. 8The training details are the same as described in [25, 32]. We train the model for 80th epochs with SGD and set the initial learning rate as 0.001. The images are then cropped to and grouped with batch size 16. For data augmentation, we randomly left-right flip and scale the image between 0.5 to 2. We also use the polynomial learning rate scheduling as mentioned in [33]. Different from the training procedure on classification, here we adopt direct optimization which training our model by combing the complement loss and primary loss together, which achieves a better empirical performance and only needs marginal extra computation cost compared to baselines.


We use the pixel accuracy (PixAcc) and mean Intersection of Union (mIoU) as the evaluation metrics with single scale evaluation. Specifically, for the PASCAL-Context dataset, we follow the procedure in the standard competition benchmark 

[34] and calculate mIoU by ignoring the pixels that are labeled as “background”.


We evaluate the quality of the segmentation from the models trained with pixel-wise cross-entropy (as baseline) and trained with HCOT, by quantitatively calculating the PixAcc and mIoU scores and visually inspecting the output image segments. Specifically, to make sure the improvement on the segmentation from HCOT comes from leveraging the label hierarchies, we have also conducted the experiment on the models trained with COT to perform a three-way comparison. Experimental results show that HCOT achieves better performance than COT and baseline (cross-entropy) as shown in Table LABEL:Segmentation_results_using_EncNet_on_Pascal_Context_dataset. We also form “EncNet+JPU” as another baseline, and the HCOT again significantly outperforms COT and cross-entropy (as shown in Table LABEL:Segmentation_results_using_Joint_Pyramid_Upsampling_based_on_EncNet_on_Pascal_Context_dataset). As segmentation does not have inherent label hierarchies, hierarchical structures among labels will have to be inferred from the data. Images occur frequently together as a theme will in-explicitly form a label hierarchy that will be learned to improve the performance of the model.


In Figure 12, we show segmentation results from three test images on PASCAL-Context dataset. In addition to the input images (Figure (a)a), we show the ground-truth segmentation (Figure (b)b) and the results from EncNet+JPU model trained with cross-entropy (Figure (c)c) and trained with the proposed HCOT (Figure (d)d). The segments generated by the proposed HCOT are less fragmented and have less noises.

6 Conclusion

In this paper, we propose Hierarchical Complement Objective Training (HCOT) to answer the motivational question. HCOT is a new training paradigm that deploys Hierarchical Complement Entropy as the training objective to leverage information from label hierarchy. HCOT neutralizes the probabilities of incorrect classes at different granularity: under the same parental category as the ground-truth class or not belong to the same branch. HCOT has been extensively evaluated on image classification and semantic segmentation tasks, and experimental results confirm that models trained with HCOT significantly outperform the state-of-the-arts. A straight-line future work is to extend HCOT into other computer vision tasks which involve rich information of latent label hierarchies but still unexplored.