1 Introduction
Many machine learning tasks involve making predictions on classes that have an inherent hierarchical structure. One example would be image classification with hierarchical categories, where a category shares the same parental category with other ones. For example, the categories with label “dog” and “cat” might share a common parental category “pet”, which forms a explicit label hierarchy. Another example would be in the task of semantic segmentation, where “beach”, and “sea” are under the same theme “scenery” which forms a latent label hierarchy, while “people”, and “pets” forms another one of “portrait.” In this work, we call a parental category a coarse(level) category, while a category under a coarse category is called a fine(level) category.
Many successful deep learning models are built and trained with crossentropy loss that assumes prediction classes to be mutually independent. This assumption works well for many tasks such as traditional image classifications where no hierarchical information is present. In the explicitly hierarchical setting, however, one problem is that learning with objectives that pose such a strong assumption makes the model difficult to utilize the hierarchical structure in the label space. Another challenge in modeling hierarchical labels is that many tasks sometime exhibit latent label hierarchy. Take semantic segmentation for example, an inherent hierarchical structure has been explored by
[32] as “’global context”. However, the dataset itself does not contain hierarchical information.In this paper, we develop a technique that is capable of leveraging the information in a label hierarchy, through proposing a new training objective. Our proposed technique is different from previous methods [9, 19, 28, 32] which exploit the label hierarchy by changing model architectures but not the objectives. The general idea we propose is to penalize incorrect classes at different granularity levels: the classes that are “obviously wrong”—different from not only the ground truth but also the parental category of ground truth—should receive larger penalty than the ones that share the same parental categories of ground truth. Such a mechanism allows us to take advantage of the information in the label hierarchy during training.
To achieve this goal of training with hierarchy information, we introduce the concept of Complement Objective Training (COT) [2, 3] into label hierarchy. In COT, the probability of the correct class is maximized by a primary objective (i.e., crossentropy), while the probabilities of incorrect classes are neutralized by a complement objective [3]. This training paradigm aims at widening the gaps between the predicted probability value of the ground truth and those of the incorrect classes. In this paper, we propose Hierarchical Complement Objective Training (HCOT) with a novel complement objective called “Hierarchical Complement Entropy” (defined in Sec. 3), by applying the idea of the complement objective on both the finelevel class and its corresponding coarselevel class.
HCOT learns the class probabilities by three folds: (a) maximizing the predicted probability of ground truth, (b) neutralizing the predicted probabilities of incorrect classes sharing the same coarselevel category as the ground truth, and (c) further penalizing others that are on different branches (in the label hierarchy) to the groundtruth class. Figure 4 illustrates the general idea of HCOT compared to crossentropy and COT, which shows HCOT leads to both confident prediction for the groundtruth class and the predicted distribution that better reflects the label hierarchy (and therefore closer to the true data distribution). Particularly, the probability mass of the classes belonging to the parental category of the ground truth (in green) to be significantly higher than the rest of the classes (in blue). In other words, the model is trained to strongly penalize the “obviously wrong” classes that are completely irrelevant to both the groundtruth class and other classes belonging to the same parental category.
We conduct HCOT on two important problems: image classification and semantic segmentation. Experimental results show that models trained with the Hierarchical complement entropy achieve significantly better performance over both crossentropy and COT, across a wide range of stateoftheart methods. We also show that HCOT improves model performance when predicting the coarselevel classes. And finally, we show that HCOT can deal with not only tasks with explicit label hierarchy but also those with latent label hierarchy. To the best of our knowledge, HCOT is the first paradigm that trains deep neural models using an objective to leverage information from a label hierarchy, and leads to significant performance improvement.
2 Background
Learning Label Hierarchy.
To leverage label hierarchies from data has been explored for general purposes for years. [7] tried to exploit discriminative properties from label hierarchies by regularization layers. [1]
studied how label hierarchies can help deep neural networks by utilizing the confusion patterns of fine categories to follow a hierarchical structure over the classes. For finegrained image classification,
[27] augments the finegrained data with auxiliary images labeled by coarse classes, which exploits a regularization between finegrained recognition models and coarsegrained recognition models. However, the abovementioned approaches are usually not compatible with the cuttingedge deep models [10, 12] and data augmentations [6, 31] proposed in recent years. Stateoftheart methods in hierarchical problems have since then tend to adopt methods that ignore hierarchical information during training.Explicit Label Hierarchy.
Many tasks exhibit explicit label hierarchy that are presented as part of the dataset. Explicit hierarchical structures exist among the class labels for a wide range of problems. Taking visual recognition as an example, there have been many prior arts on nonneural models focused on exploiting the hierarchical structure in categories [24]. For neural models, HDCNN [28]
is an early work using the category hierarchy to improve performance over the flat Nway deepnetwork classifiers. The network architecture of HDCNN contains a coarse component and several finegrained components for learning from labels of different levels. Unlike HDCNN which uses one fixed model, Blockout
[19] uses a regularization framework that learns both the model parameters and the subnetworks within a deep neural network, to capture the information in a label hierarchy. Another prior art named CNNRNN [9]combines the CNNbased classifier with a Recurrent Neural Network to exploit the hierarchical relationship, sequentially from the coarse categories to the fine ones. All of the abovementioned approaches rely on modifying model architectures to capture the hierarchical structures among the class labels. This raises an intriguing question: Is it possible to design a training objective, rather than proposing a new model architecture, for a deep neural network to effectively capture the information contained in a label hierarchy?
Latent Label Hierarchy.
Another group of tasks are rather exclusive on the hierarchical information but has an underlying assumption on an inherent label structure. Semantic segmentation is one of such tasks where cooccurrence of the class labels forms a latent label hierarchy. This hierarchy is not directly observed in the data but can be inferred from the data. In semantic segmentation, the goal is to assign a semantic label to each pixel of an image. Typically, when training a deep network model for semantic segmentation, the information of individual pixels are usually taken in isolation. That is, the perpixel crossentropy loss is calculated for an image, with respect to the ground truth labels. To consider the global information, EncNet [32] first utilizes the semantic context of scenes by exploiting model structures and provides a strong baseline in semantic segmentation. However, we argue that the potential of leveraging global information on the labeling space is still not discovered.
3 Hierarchical Complement Objective Training
In this section, we introduce the proposed Hierarchical Complement Objective Training (HCOT), which is a new training paradigm for leveraging information in a label hierarchy. Specifically, a novel training objective, Hierarchical Complement Entropy (HCE), is defined as the complement objective for HCOT. In the following, we first review the concept of the complement objective, and then provide the mathematical formulation of HCE.
Complement Objective.
In Complement Objective Training (COT) [3], a neural model is trained with both a primary objective and a complement objective: the primary objective (e.g., crossentropy) is for maximizing the predicted probability of the groundtruth class, whereas the complement objective (e.g., complement entropy [3]) is designed to neutralize the predicted probabilities of incorrect classes, which intuitively makes a model more confident about the groundtruth class. Eq(1) gives the definition of the complement entropy:
(1) 
where is the total number of samples, is the set of labels. For the sample,
is the vector of logits for the sample. Let
be the corresponding groundtruth class for the sample, so represents the set of incorrect classes. We use to annotate the Shannon entropy function [22] over the probability , defined below.(2) 
where , and the probability function is defined as the output of the softmax function:
(3) 
Intuitively, is the dimension of a multinomial distribution normalized among the incorrect classes over logits (that is, excluding the probability mass of the groundtruth class). Please note that the alternative definition of complement entropy is mathematically equivalent to the one presented in [3].
Despite the good performance by maximizing Complement entropy to make complement events equally like to occur, this approach do not consider the generalization gap between predicted distributions and true data distributions. For example, if the groundtruth class is “dog”, to flatten the predicted probabilities on irrelevant classes such as “cat” and “truck” is counterintuitive.
Hierarchical Complement Entropy.
The proposed Hierarchical Complement Entropy (HCE) regulates the probability masses, similar to what the complement entropy does, but in a hierarchical fashion. Let a subgroup be a set that contains the sibling classes that belong to the same parental class of the groundtruth class, that is, and . HCE will first regulate complement entropy between the subgroup and the ground truth followed by the complement entropy between label space and subgroup . Detailed definition can be found in Eq(3). The proposed HCE is defined as the following with being the model parameters:
(4) 
It is not hard to see that Eq(4) is a direct implementation of the predicted probabilities trained with HCOT procedure in Figure 4, which impose probability regulation based on the hierarchical structure of the labels. regulates inner hierarchy, which corresponds to the relationship between the probability masses marked as red and green. The second term, , regulates the outer hierarchy, which corresponds to the relationship between the green and blue class labels. Hierarchical complement entropy ensures that the gaps between each of the hierarchies are as wide as possible to enforce the hierarchical structure during training. In the extreme case when , the second term in Eq(4) disappears and the Hierarchical complement entropy degenerates to Complement entropy.
Optimization.
Our loss function consists of two terms: the normal cross entropy term (i.e.,
), and the complement objective term .(5) 
In Direct optimization, we simply add two objectives together to be Eq(5) and directly optimize it using SGD without any hyperparameter tuning or balancing. An alternative approach is Alternative optimization, which optimizes the crossentropy term and the complement objective term interleaved. This is done by maximizing HCE followed by minimizing XE for a single training iteration, which follows [3] to have fair comparisons. In our paper, we choose between these two methods to achieve the best performance for our models.
4 Image Classification
In this section, we evaluate HCOT on image classification tasks. Experiments are conducted with two widelyused datasets that contain label hierarchy: CIFAR100 [14] and ImageNet2012 [15].
We conduct extensive experiments on CIFAR100 dataset to study several aspects of HCOT:

Does HCOT improve the performance over the stateoftheart models?

Can HCOT work in synergy with other commonlyused regularization techniques such as Mixup and Cutout?

Does HCOT improve the classification accuracy of coarse classes over the stateoftheart models?

How will HCOT affect the latent representation (embedding) of a model?
In addition, we also perform a sidebyside comparison between HCOT and one stateoftheart—CNNRNN [9]— that also uses label hierarchy to train image classifiers. The experimental results confirm the proposed HCOT better captures label structure and learns a more accurate model.
4.1 Cifar100
CIFAR100 is a dataset consisting of 60k colored natural images of 32x32 pixels equally divided into 100 classes. There are 50k images for training and 10k images for testing. The official guide CIFAR100 [14] further groups the 100 classes into 20 coarse classes where each coarse class contains five fine classes, forming the label hierarchy. Therefore, each image sample has one fine label and one coarse label. Here we follow the standard data augmentation techniques [10]
to preprocess the dataset. During training, zeropadding, random cropping, and horizontal mirroring are applied to the images. For the testing images, we use the original images of
pixels.Experimental Setup.
For CIFAR100, we follow the same settings as the original ResNet paper [10]. Specifically, the models are trained using SGD optimizer with momentum of 0.9; weight decay is set to be 0.0001 and learning rate starts at 0.1, then being divided by 10 at the 100^{th} and 150^{th}epoch. The models are trained for 200 epochs, with a minibatch size of 128. For training WideResNet, we follow the settings described in [30], and the learning rate is divided by 10 at the 60^{th}, 120^{th} and 180^{th} epoch. In addition, no dropout [23] is applied to any baseline according to the best practices in [13]. We follow alternating training [3], where models are trained by alternating between the primary objective (i.e., crossentropy) and the complement objective (i.e., Hierarchical Complement Entropy).
Results.
Our method demonstrates improvements over all of the stateoftheart models compared to baseline and COT, improving error rates by a significant margin. These models range from the widely used ResNet to the SEResNet [12], which is the winner of the ILSVRC 2017 classification competition. SEResNet considers novel architecture units named SqueezeandExcitation block (SE block) in ResNet framework for explicitly capturing the interdependencies between channels of convolutional layers. Results are shown in Table 1.
Model  Baseline  COT  HCOT 

ResNet56 [10]  29.41  27.76  27.3 
ResNet110 [10]  27.93  27.24  26.46 
SEResNet56 [12]  28.11  27.04  26.54 
SEResNet110 [12]  26.49  26.09  25.49 
PreAct ResNet18 [11]  25.44  24.73  23.8 
ResNeXt29 (264d) [26]  23.45  21.9  21.64 
WideResNet2810 [30]  21.91  20.99  20.32 
Results with Mixup and Cutout.
We also show that HCOT can be applied in synergy with other commonlyused techniques to further improve model performance. We conduct experiments on ResNet110 with “Cutout” [6] for input masking and “Mixup ” [31] for data augmentation. Table 2 shows the accuracy of models trained with HCOT consistently outperform the baseline and the models trained with COT.
Model  Baseline  COT  HCOT 

ResNet110 + Cutout  24.61  23.93  23.85 
ResNet110 + Mixup  24.46  23.82  23.33 
Analysis on Coarselevel Labels.
To understand the places where performance improvements of HCOT coming from, we show the results by splitting them into coarse and fine labels in Table 3. Here we see that HCOT improves the performance significantly on the coarselevel labels, where COT hardly improves. Such a performance improvement is a direct result of modeling label hierarchies, which is not taken into account in either baseline or COT. Surprisingly, HCOT also improves finelevel labels significantly, over the already improved results of COT. This suggests that modeling of the finelevel labels can benefit from modeling label hierarchies.
Label  Baseline  COT  HCOT 

Coarse  15.08  15.05  14.02 
Fine  24.21  23.33  22.64 
Embedding Space Visualization.
A visualization of logits of the coarselevel labels are shown in Figure 7. Here we compare it against the visualisation from the baseline SEPreAct ResNet18 [12] trained using crossentropy. Compared to the baseline, the HCOT seems to form more distinct clusters in the embedding space that have clear separable boundaries, by which we conjecture that the model generalizes better and therefore achieves better performance.
Comparison with CNNRNN.
To demonstrate the proposed HCOT effectively leverages label hierarchy, we compare the proposed HCOT with another stateoftheart—CNNRNN [9]— that also leverages label hierarchy for training an image classifier. Specifically, CNNRNN framework is also proposed to take advantage of label hierarchy using an novel neural architecture: combining CNN with RNN. CNN is in charge of extracting discriminative features from images and RNN enables the joint optimization by using coarse and fine labels. In the CNNRNN framework, WideResNet2810 (denoted as WRN) has been selected as the base model and another RNN is constructed upon the WRN. For a fair comparison, we evaluate HCOT on the same WRN architecture, and the experimental results are provided in Table 4. The proposed HCOT achieves significantly better accuracy than WRNRNN, confirming that the proposed HCE effectively captures the information from label hierarchy. Also, notice WRNHCOT is more parameter efficient—WRNHCOT requires less parameters than WRNRNN since WRNRNN requires a whole RNN on top of WRN.
Method  WRNRNN  WRNHCOT 
Top1 Error  21.57  20.32 
4.2 ImageNet2012
ImageNet2012 [15] is a largescale dataset for image classification with 1k fine categories. This dataset consists of approximately 1.3 million training images and 50k validating images, and each image has pixels. In addition, the image labels of ImageNet2012 are from the “leaf classes” of WordNet [17]; WordNet is a lexical database for the English language, which organizes words into hierarchies defined by hypernym or ISA relationships. We follow the prior art on object detection (YOLO9000 [20]) to construct hierarchies for labels in ImageNet2012. Specifically, leaf classes which belong to the same subtree are grouped together, and their parental synsets are extracted as the parental classes, forming a twolevel hierarchy containing synsets as parental classes and leaf (or finegrained) classes. As a matter of fact, many literature [9, 21] that aims at improving ImageNet2012 classification use similar preprocessing steps to construct a treebased hierarchy into twolevel hierarchy.
Experimental Setup.
To prepare for experiments, we apply random crops and horizontal flips during training, while images in the testing set use center crops (1crop testing) for data augmentation [10]. We follow [8] as our experimental setup: 256 minibatch size, 90 total training epochs, and 0.1 as the initial learning rate starting that is decayed by dividing 10 at the 30th, 60th and 80th epoch. We use the same alternating training as we did in the CIFAR100 dataset [3].
Results.
As the main result, we conduct HCOT with 52 coarse categories. Results in Table 5 shows significant improvements on both top1 and top5 error rates compared to COT and the baseline (ResNet50 using crossentropy). We note that top5 error inexplicitly tests the model’s abilities for hierarchical labels.
Baseline  COT  HCOT  

Top1 Error  24.7  24.4  24.0 
Top5 Error  7.6  7.4  7.1 
Ablation study.
To explore the effect to HCOT over different granularity of the coarse classes, we conduct a study on performance of our model over a range of coarse classes = {1, 20, 52, 145, 1000} on ImageNet2012. We observe that HCOT perform best over the sufficient information of category hierarchies. In this case, the performance on Top1 error peaks at . When the goes to the extremes (i.e., 1 or 1000), HCOT degrades to COT. This can be illustrated in Eq(4). When , the left term in the equation will disappear, making it the same as COT. Similarly, when , the right term will disappear as there are only 1000 classes.
1  20  52  145  1000  

Top1 Error  24.4  24.3  24.0  24.2  24.4 
5 Semantic Segmentation
In the task of semantic segmentation, there is latent, hierarchical information contained among the labels [32]. In this task, label hierarchy is not defined or given explicitly, but are rather inferred from the dataset. Applying HCOT to this task can make effective use of this inferred information in the label space. In particular, the proposed HCOT procedure can achieve both high confidence of groundtruth and attention of global scene information for each label, which maintains the hierarchy between each semantic and the corresponding theme in a same image sample and helps to provide more accurate semantic segmentation.
Dataset.
We apply HCOT on the widelyused “PascalContext” dataset [18]. Each image in the PASCALContext dataset has dense, semantic labels over the entire scene of the image. The dataset contains 4,998 images for training and 5,105 for testing. We follow the prior arts [4, 16, 18] and create a set of 60 semantic labels for segmentation. These 60 semantic labels represent the most frequent 59 object categories, plus the “background” category.
Experimental Setup.
We first take EncNet (Context Encoding Module) [32] to be the baseline. Here we follow the previous works [5, 29, 33] to use the dilated network strategy on the pretrained ResNet50. In addition, we perform the Joint Pyramid Upsampling (JPU) [25] instead of the dilated convolution over EncNet (denoted as “EncNet+JPU”) to reproduce the stateoftheart results in semantic segmentation. JPU can formulate the task of extracting high resolution feature maps into a joint upsampling problem. 8The training details are the same as described in [25, 32]. We train the model for 80^{th} epochs with SGD and set the initial learning rate as 0.001. The images are then cropped to and grouped with batch size 16. For data augmentation, we randomly leftright flip and scale the image between 0.5 to 2. We also use the polynomial learning rate scheduling as mentioned in [33]. Different from the training procedure on classification, here we adopt direct optimization which training our model by combing the complement loss and primary loss together, which achieves a better empirical performance and only needs marginal extra computation cost compared to baselines.
Evaluation.
We use the pixel accuracy (PixAcc) and mean Intersection of Union (mIoU) as the evaluation metrics with single scale evaluation. Specifically, for the PASCALContext dataset, we follow the procedure in the standard competition benchmark
[34] and calculate mIoU by ignoring the pixels that are labeled as “background”.Results.
We evaluate the quality of the segmentation from the models trained with pixelwise crossentropy (as baseline) and trained with HCOT, by quantitatively calculating the PixAcc and mIoU scores and visually inspecting the output image segments. Specifically, to make sure the improvement on the segmentation from HCOT comes from leveraging the label hierarchies, we have also conducted the experiment on the models trained with COT to perform a threeway comparison. Experimental results show that HCOT achieves better performance than COT and baseline (crossentropy) as shown in Table LABEL:Segmentation_results_using_EncNet_on_Pascal_Context_dataset. We also form “EncNet+JPU” as another baseline, and the HCOT again significantly outperforms COT and crossentropy (as shown in Table LABEL:Segmentation_results_using_Joint_Pyramid_Upsampling_based_on_EncNet_on_Pascal_Context_dataset). As segmentation does not have inherent label hierarchies, hierarchical structures among labels will have to be inferred from the data. Images occur frequently together as a theme will inexplicitly form a label hierarchy that will be learned to improve the performance of the model.
Visualizations.
In Figure 12, we show segmentation results from three test images on PASCALContext dataset. In addition to the input images (Figure (a)a), we show the groundtruth segmentation (Figure (b)b) and the results from EncNet+JPU model trained with crossentropy (Figure (c)c) and trained with the proposed HCOT (Figure (d)d). The segments generated by the proposed HCOT are less fragmented and have less noises.
6 Conclusion
In this paper, we propose Hierarchical Complement Objective Training (HCOT) to answer the motivational question. HCOT is a new training paradigm that deploys Hierarchical Complement Entropy as the training objective to leverage information from label hierarchy. HCOT neutralizes the probabilities of incorrect classes at different granularity: under the same parental category as the groundtruth class or not belong to the same branch. HCOT has been extensively evaluated on image classification and semantic segmentation tasks, and experimental results confirm that models trained with HCOT significantly outperform the stateofthearts. A straightline future work is to extend HCOT into other computer vision tasks which involve rich information of latent label hierarchies but still unexplored.
References

[1]
Bilal Alsallakh, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren.
Do convolutional neural networks learn class hierarchy?
IEEE Transactions on Visualization and Computer Graphics, Volume: 24, 2017.  [2] HaoYun Chen, JhaoHong Liang, ShihChieh Chang, JiaYu Pan, YuTing Chen, Wei Wei, and DaCheng Juan. Improving adversarial robustness via guided complement entropy. In ICCV’19, 2019.
 [3] HaoYun Chen, PeiHsin Wang, ChunHao Liu, ShihChieh Chang, JiaYu Pan, YuTing Chen, Wei Wei, and DaCheng Juan. Complement objective training. In ICLR’19, 2019.
 [4] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915, 2016.
 [5] LiangChieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [6] Terrance Devries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
 [7] Wonjoon Goo, Juyong Kim, Gunhee Kim, and Sung Ju Hwang. Taxonomyregularized semantic deep convolutional neural networks. In ECCV’16, 2016.
 [8] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [9] Yanming Guo, Yu Liu, Erwin M. Bakker, Yuanhao Guo, and Michael S. Lew. Cnnrnn: a largescale hierarchical image classification framework. Multimedia Tools and Applications, 2018.
 [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR’16, 2016.
 [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV’16, 2016.
 [12] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In CVPR’18, June 2018.
 [13] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML’15, 2015.
 [14] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS’12, 2012.
 [16] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D. Reid. Refinenet: Multipath refinement networks for highresolution semantic segmentation. arXiv preprint arXiv:1611.06612, 2016.
 [17] George A. Miller. Wordnet: A lexical database for english. COMMUNICATIONS OF THE ACM, 1995.
 [18] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR’14, 2014.
 [19] Calvin Murdock, Zhen Li, Howard Zhou, and Tom Duerig. Blockout: Dynamic model selection for hierarchical deep networks. In CVPR’16, 2016.
 [20] Farhadi Redmon. Yolo9000: Better, faster, stronger. In CVPR’17, 2017.
 [21] M. Ristin, J. Gall, M. Guillaumin, and L. Van Gool. From categories to subcategories: Largescale image classification with partial class label refinement. In CVPR’15, 2015.
 [22] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 1948.
 [23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
 [24] AnneMarie Tousch, StÃ©phane Herbin, and JeanYves Audibert. Semantic hierarchies for image annotation: A survey. Pattern Recognition, 2012.
 [25] Huikai Wu, Junge Zhang, Kaiqi Huang, Kongming Liang, and Yizhou Yu. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv: 1903.11816, 2019.
 [26] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR’17, 2017.
 [27] S. Xie, T. Yang, Xiaoyu Wang, and Yuanqing Lin. Hyperclass augmented and regularized deep learning for finegrained image classification. In CVPR’15, 2015.
 [28] Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou Yu. Hdcnn: Hierarchical deep convolutional neural networks for large scale visual recognition. In ICCV’15, 2015.
 [29] Fisher Yu, Vladlen Koltun, and Thomas A. Funkhouser. Dilated residual networks. CVPR’17, 2017.
 [30] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC’16, 2016.
 [31] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David LopezPaz. Mixup: Beyond empirical risk minimization. In ICLR’18, 2018.
 [32] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In CVPR’18, 2018.
 [33] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR’17, July 2017.
 [34] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR’17, 2017.