Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification

01/10/2022
by   Jingzhou Chen, et al.
Zhejiang University
6

Hierarchical multi-granularity classification (HMC) assigns hierarchical multi-granularity labels to each object and focuses on encoding the label hierarchy, e.g., ["Albatross", "Laysan Albatross"] from coarse-to-fine levels. However, the definition of what is fine-grained is subjective, and the image quality may affect the identification. Thus, samples could be observed at any level of the hierarchy, e.g., ["Albatross"] or ["Albatross", "Laysan Albatross"], and examples discerned at coarse categories are often neglected in the conventional setting of HMC. In this paper, we study the HMC problem in which objects are labeled at any level of the hierarchy. The essential designs of the proposed method are derived from two motivations: (1) learning with objects labeled at various levels should transfer hierarchical knowledge between levels; (2) lower-level classes should inherit attributes related to upper-level superclasses. The proposed combinatorial loss maximizes the marginal probability of the observed ground truth label by aggregating information from related labels defined in the tree hierarchy. If the observed label is at the leaf level, the combinatorial loss further imposes the multi-class cross-entropy loss to increase the weight of fine-grained classification loss. Considering the hierarchical feature interaction, we propose a hierarchical residual network (HRN), in which granularity-specific features from parent levels acting as residual connections are added to features of children levels. Experiments on three commonly used datasets demonstrate the effectiveness of our approach compared to the state-of-the-art HMC approaches and fine-grained visual classification (FGVC) methods exploiting the label hierarchy.

READ FULL TEXT VIEW PDF

page 1

page 6

11/18/2020

Your "Labrador" is My "Dog": Fine-Grained, or Not

Whether what you see in Figure 1 is a "labrador" or a "dog", is the ques...
07/20/2022

On Label Granularity and Object Localization

Weakly supervised object localization (WSOL) aims to learn representatio...
09/12/2020

Exploring the Hierarchy in Relation Labels for Scene Graph Generation

By assigning each relationship a single label, current approaches formul...
09/02/2019

HiCoRe: Visual Hierarchical Context-Reasoning

Reasoning about images/objects and their hierarchical interactions is a ...
08/07/2022

Preserving Fine-Grain Feature Information in Classification via Entropic Regularization

Labeling a classification dataset implies to define classes and associat...
12/04/2021

Label Hierarchy Transition: Modeling Class Hierarchies to Enhance Deep Classifiers

Hierarchical classification aims to sort the object into a hierarchy of ...
07/26/2022

Learning Hierarchy Aware Features for Reducing Mistake Severity

Label hierarchies are often available apriori as part of biological taxo...

1 Introduction

Traditional single-granularity classification usually assigns a single label to a given object from a set of mutually exclusive class labels. For instance, FGVC aims at distinguishing objects from different subordinate-level categories within a given object category, e.g., subcategories of birds [32], cars [16], aircraft [20]. However, the definition of what is fine-grained is subjective, and the image quality may affect the identification, as illustrated in Fig. 1. A bird can be discerned as Albatross or Laysan Albatross due to differences in domain knowledge. Moreover, a bird expert recognizes a bird as Albatross rather than Black-footed Albatross because of the occlusion of key parts. Airborne or satellite image resolutions often have large variations, causing objects to be recognized at different levels. These challenges increase the difficulty of constructing a dataset for single-granularity classification, while images annotated as coarse categories are also overlooked.

(a) Differences in domain knowledge and interference from the image occlusion.
(b) Large variations of image resolutions.
Figure 1: Different objects can be discerned at various levels in the label hierarchy due to differences in domain knowledge or image quality such as occlusion or resolution.

Compared to single-granularity classification, a more preferable solution is to employ hierarchical multi-granularity labels to describe an object, which provides more flexible options for annotators with different knowledge backgrounds [4]. HMC [15]

aims to exploit hierarchical multi-granularity labels and embeds the label hierarchy in loss function or network architecture. Whereas conventional HMC usually evaluates each sample with complete hierarchical labels from the coarsest to the finest granularity. A more robust HMC model should effectively utilize examples observed at various levels in the hierarchy,

e.g., making use of bird images annotated as [“Albatross”] and [“Albatross”, “Laysan Albatross”].

In this paper, we study the HMC problem in which samples are labeled at any level of the hierarchy. We factorize this problem into two aspects: (1) how to effectively use instances labeled at different levels; (2) how to perform hierarchical feature interaction in the network architecture. For the first problem, we adopt a tree hierarchy that defines two kinds of semantic relationships between labels: parent-child correlations between levels and mutual exclusion at the same level. Inspired by the work of [7], if an instance is discerned at a label in the hierarchy, we maximize its marginal probability in the probability space constrained by the tree hierarchy. Such marginalization enjoys two benefits: learning with the coarse-level label could impact decisions of fine-grained subclasses while learning with the fine-level label aids the prediction of coarse-grained superclasses. Moreover, if the ground truth label is observed at the leaf level, we further impose the multi-class cross-entropy loss to enhance the discriminative power among fine-grained categories.

Another critical issue is to design appropriate hierarchical feature interaction that reflects the label hierarchy. A distinct characteristic of hierarchical categories is that from coarse-to-fine levels, fine-level classes not only have unique attributes but also inherit attributes related to coarse-level superclasses. Based on this property, we propose a hierarchical residual network (HRN) illustrated in Fig. 2. We first set up granularity-specific layers to disentangle hierarchical features from the trunk network. Then, these hierarchical features interact via residual connections [11, 34, 14, 13, 12, 19, 30], i.e

., features from parent levels acting as skip connections are added to features of children levels. Experiments on three commonly used FGVC datasets demonstrate the effectiveness of our approach compared to the state-of-the-art HMC approaches and FGVC methods exploiting hierarchical knowledge under two evaluation metrics 

[31].

Figure 2: The network architecture consists of a trunk network (ResNet-50), hierarchical feature interaction module, and two parallel output channels: and forming the probabilistic classification loss () and the cross-entropy loss (

), respectively. We illustrate the network architecture on CUB-200-2011 dataset that contains three hierarchical levels. Granularity-specific block for each hierarchical level process feature maps generated from the trunk network, then these hierarchical features interact via residual connections,

i.e., features from parent levels acting as skip connections are added to features of children levels. organizes sigmoid outputs from three hierarchical levels using the tree hierarchy, and generates softmax outputs corresponding to the fine-grained leaf categories.

2 Related Work

2.1 Hierarchical Multi-Granularity Classification

HMC problems naturally arise in many domains, such as text categorization [17, 25, 22] and functional genomics [1, 31, 26]. In text categorization, an increasing number of works [21, 29, 23, 5] leveraged the label hierarchy to improve accuracy. In image classification, HMC systems have been used to annotate medical images [8]

and classify diatom images 

[9]

. Based on deep neural networks (DNNs), the studies usually go along two paths: mapping the label hierarchy to network architectures 

[3, 2, 24, 33] or loss functions that impose the hierarchical constraints [7, 10]

. HMC with local multi-layer perceptrons (HMC-LMLP) 

[3]

proposed to train a chain of multi-layer perceptron (MLP) networks, each corresponding to a hierarchical level. The input of each MLP uses the output provided by the previously trained MLP to augment the feature vector of the instance. This supervised incremental greedy procedure continues until the last level of the hierarchy is reached. HMC network (HMCN) 

[33] comprised multiple local outputs, with one local output layer per hierarchical level of the class hierarchy plus a global output layer that captures the cumulative relationships forwarded across the entire network. All local outputs are then concatenated and pooled with the global output to generate a final consensual prediction. HMC-LMLP and HMCN embed label hierarchy in their network architecture. Their loss functions sum over binary cross-entropy losses from each hierarchical level, which assumes each label is independent of each other, causing the implicit hierarchical relations between two semantic labels to be ignored.

Another line of HMC works encodes the label hierarchy in loss functions by imposing the hierarchical constraints. Coherent HMC neural network (C-HMCNN) [10] revised the binary cross-entropy loss to satisfy the parent-child constraint. The revision ensures that no hierarchy violation happens, i.e., for any threshold, when C-HMCNN predicts a sample belonging to a class, this sample also belongs to its parent classes. Moreover, C-HMCNN can teach the network how to better make the prediction on the higher level classes using the prediction results on the lower level ones. While C-HMCNN only restricts the parent-child correlation, other kinds of semantic relations between hierarchical labels can be constructed using graphs. Deng et al[7] formalized semantic connections between any two labels into a directed acyclic graph (DAG). They built a modified junction tree algorithm that contains multiple loops during message passing on the junction tree to compute the probabilistic classification loss defined on the DAG.

2.2 Fine-Grained Visual Classification

Since FGVC inherently forms a hierarchy with different levels of concept abstraction, many approaches [35, 36, 28, 6, 4] proposed to exploit the hierarchical label structure of FGVC. Zhang et al[35] generalized the triplet loss by describing inequalities of the distance between images belonging to the same fine-grained class, different fine-grained classes but the same coarse class, and different coarse classes. Shi et al[28]

proposed a generalized large-margin loss that not only reduces between-class similarity and within-class variance of the learned features but also makes the subclasses belonging to the same coarse class be more similar than those belonging to different coarse classes in the feature space. Chen

et al[6] developed a novel hierarchical semantic embedding framework that incorporates the predicted score vector of the higher level as prior knowledge to learn finer-grained feature representation at each hierarchical level. During training, the predicted score vector of the higher level is also employed to regularize sub-categories prediction by using it as soft targets. Chang et al[4] leveraged level-specific classification heads to disentangle coarse-level features with fine-grained ones and allowed fine-grained features to participate in coarser-grained label predictions but constraining the gradient flow to only update the parameters within each classification head. Their method reaches the state-of-the-art results on the traditional single-label FGVC problem. These approaches refine the feature representation related to hierarchical levels in the feature space. They developed the loss function based on the multi-class cross-entropy loss that implies the mutual exclusion among classes at the same hierarchical level. However, they neglect to encode other label relations like the parent-child correlation to transfer hierarchical knowledge between levels using samples observed at different levels.

3 Proposed Methods

3.1 Network Architecture

Our network architecture includes a trunk network, a hierarchical feature interaction module, and two parallel output channels, see Fig. 2

. The trunk network is used to extract features from the input images and any common network is applicable. Here, we adopt the ResNet-50 as the trunk network since it is widely used for feature extraction. The hierarchical feature interaction module contains granularity-specific block and residual connections. These blocks share the same structure that comprises two convolutional layers and two fully connected (FC) layers. Each block is designed to extract the specialized feature for one hierarchical level. The residual connections first linearly combine features of fine-level subclasses with features of coarse-level superclasses. Accordingly, subclasses not only have unique attributes but also inherit the attributes from their superclasses. Then, non-linear transformation (ReLU) is applied to combined features.

We set up two output channels in our model. The first output channel is utilized to compute the probabilistic classification loss based on the tree hierarchy, in which each sigmoid node corresponds to a distinct label in the hierarchy. We perform the non-linear projection by sigmoid instead of softmax because sigmoid reflects the independent relations, whereas softmax implies mutual exclusion. The sigmoid nodes from each hierarchical level are then organized with the tree hierarchy to comply with the hierarchical constraints. The second output channel computes the multi-class cross-entropy loss imposed on the leaf level so that the mutually exclusive fine-grained classes gain more attention during training. For simplicity, we denote the first and the second output channel as and , respectively.

3.2 Loss Function

The proposed combinatorial loss integrates two forms of losses: the probabilistic classification loss and the multi-class cross-entropy loss. We first formalize the tree hierarchy to encode semantic relations between hierarchical labels. The probabilistic classification loss defined on the tree hierarchy aims to transfer hierarchical knowledge during training. We empirically find that if the training samples labeled at the leaf level are few, the probabilistic classification loss fails to well separate the fine-grained leaf classes. One simple but feasible solution is to increase the weight of fine-grained classification loss. Therefore, we further impose the multi-class cross-entropy loss on the leaf categories, which obeys the mutually exclusive constraint among fine-grained classes defined in the tree hierarchy.

3.2.1 The Formalism of Tree Hierarchy

The tree hierarchy consists of a set of nodes , directed edges , and undirected edges . Each node corresponds to a distinct class label. The number of nodes equals the number of all labels in the hierarchy. A directed edge is a subsumption edge, indicating that class subsumes label , e.g., Albatross is a parent or superclass of Black-footed Albatross. An undirected edge is an exclusion edge, denoting that classes and are mutually exclusive, e.g., a bird cannot be the Black-footed Albatross and Laysan Albatross simultaneously. Any two nodes share a subsumption edge or an exclusion edge.

Each class label takes binary values, i.e., , representing whether an object belongs to this class or not. Each edge then defines a constraint on the binary values that two labels of its incident nodes can take. An assignment of (e.g. a Black-footed Albatross but not a Albatross) for a subsumption edge is illegal, while (it is both Black-footed Albatross and Laysan Albatross) is also an illegal assignment for an exclusion edge . Defined by these local constraints of individual edges, a legal global assignment of all labels in the hierarchy is a binary label vector for an object. The set of all legal global assignments forms the state space of tree . We can infer to be a matrix , where each row represents a legal binary label vector . We traverse all legal assignments by assigning each label a value of , along with an assignment that is all zeros.

3.2.2 Probabilistic Classification Loss

We calculate the probabilistic classification loss from , and each sigmoid node in corresponds to a class label in the tree hierarchy. Suppose the number of sigmoid nodes is , and is the binary label vector representing an assignment of all labels. Given an input image , the joint probability of all sigmoid nodes concerning the assignment can be computed as:

(1)

where is the sigmoid output of the -th label node, is the unnormalized probability, and . is the constraint defined in the tree hierarchy between any two labels in :

(2)

The joint probability is then normalized by , where is the partition function that sums over all legal assignments in the state space of tree :

(3)

If input image is observed at the -th label in the tree hierarchy, i.e., , we can obtain the marginal probability of label by summing over all legal assignments that include :

(4)

The marginal probability of a leaf label in tree relies on the sum of its ancestors’ scores because all its ancestors must be 1 if the label of this leaf node takes value 1, which enables the parents’ scores to impact the descendants’ decisions. On the other hand, the marginal probability of a parent label is marginalized over all possible states of its descendants, i.e., aggregating the information from all its subclasses.

We propose to compute marginalization via matrix multiplication. Suppose the network outputs from , where is the number of sigmoid nodes, and stands for batch size. Each column in is the output vector corresponding to a sample in the batch. The unnormalized joint probability can be computed as , and the partition function can be calculated by summing each column of . To obtain the marginal probability of the -th sample labeled at , we first search for eligible rows in the -th column of that qualify , then we sum the corresponding elements in the -th column of , finally, we normalize the summation by dividing the -th element in .

In the training process, the observed label can be at any level of the hierarchy, and we maximize the marginal likelihood of the observed ground truth label. Given training samples , , where is the complete ground truth label vector and is the index of the observed label, the probabilistic classification loss is defined as:

(5)

3.2.3 Combinatorial Loss

The multi-class cross-entropy loss is commonly used in FGVC to separate fine-grained categories. We add to our model to further increase the discriminative power for fine-grained leaf classes. employs softmax outputs from , in which each node corresponds to a fine-grained leaf label in the tree hierarchy. Softmax outputs imply mutually exclusive relations among fine-grained classes, which is consistent with the hierarchy constraint defined in the tree hierarchy. We combine with as follows:

(6)

Depending on whether is labeled at fine-grained leaf categories, the combined loss decides whether it needs to incorporate or not. Finally, the total loss on is:

(7)

4 Experiments

4.1 Implementaion Details

In all our experiments, we resize input images to

and train every single experiment for 200 epochs. Random horizontal flipping and random cropping (random cropping for training and center cropping for testing) are applied for data augmentation. We adopt ResNet-50 pre-trained on ImageNet as our trunk network and use stochastic gradient descent (SGD) with a momentum of 0.9, weight decay of 0.0005 to optimize our model. The batch size is set to 8. Meanwhile, the learning rates of the convolution layers and the FC layers newly added for hierarchical interaction are initialized as 0.002 and adjusted by the cosine annealing strategy 

[18]. The learning rates of the trunk layers are maintained as 1/10 of the newly added layers. The code will be made publicly accessible.

4.2 Datasets and Experimental Designs

We evaluate our proposed method on three widely used FGVC datasets, i.e., CUB-200-2011 [32], Aircraft [20]

, and Stanford Cars 

[16]. However, CUB-200-2011 and Stanford Cars only provide one fine-grained label for each image. To construct a taxonomy of label hierarchy for these two datasets, we learn from the work of Chang et al[4], in which they trace parent nodes in Wikipedia pages. CUB-200-2011 is the most widely used benchmark for FGVC. It covers 11,788 bird images that are re-organized into a three-level label hierarchy with 13 orders, 38 families, and 200 species. Aircraft contains 10,000 images consisting of a three-level label hierarchy with 30 makers, 70 families, and 100 models. Stanford Cars contains 16,185 images of 196 classes of cars. It is re-organized into a two-level label hierarchy with 9 car types and 196 specific models. We do not use any bounding box/part annotations in all our experiments and adopt the official train and test splits for evaluation.

Besides assigning hierarchical multi-granularity labels for each image, experimental designs simulate the aforementioned situation where samples are observed at different levels of the hierarchy. To imitate the lack of domain knowledge, we select 0%, 30%, 50%, 70%, and 90% samples from each fine-grained class and relabel their last-level fine-grained classes to immediate parent classes in the training set, respectively. Considering the impact of image quality, we conduct another experiment by reducing the image resolution of selected samples using the nearest-neighbor interpolation with a factor of 4 after relabeling. The extreme case 0% represents the conventional setting of HMC or fine-grained classification that exploits the label hierarchy. Other cases indicate that part of samples are observed at internal levels of the tree hierarchy, and the rest owns the complete label hierarchy from the highest level to the lowest fine-grained level. All images in the test set are tested with the complete label hierarchy.

4.3 Evaluation Metrics

To reasonably evaluate the performance of HMC on FGVC datasets, we employ two evaluation metrics. The first metric follows the convention of FGVC and uses the overall accuracy (OA). The output of HMC models is a probability vector for each class. Concerning the hierarchical label structure, we take the maximum value of the output probability vector corresponding to each hierarchical level as the predicted label and compute OA on the test set. The second criterion commonly used in HMC literature [31, 33, 10]

measures the area under the average precision and recall curve

. Instead of calculating the precision and recall curve (PRC) for each class, computes an average PRC to evaluate the output probability vector of all classes in the hierarchy. Specifically, for a given threshold value, one point in the average PRC is computed as:

(8)

where ranges over all classes, and , , and are the numbers of true positives, false positives, and false negatives for class label , respectively. By varying the threshold, an average PRC is obtained and denotes the area under this curve. also has the advantage of being independent of the threshold used to predict when a sample belongs to a particular class (which is often heavily application-dependent).

4.4 Ablation Study

In this section, we first conduct ablation studies to investigate two key designs of the proposed method on CUB-200-2011: HRN and combinatorial loss. Then, we conduct visualization experiments by illustrating the attention regions for each hierarchical level using Grad-Cam [27] to demonstrate the role of HRN and combinatorial loss.

4.4.1 Significance of HRN

As displayed in Fig. 2, we analyze three components of HRN: granularity-specific block (GSB), the linear combination of hierarchical features (LC), and non-linear transformation of combined features (ReLU). We report OA on the species level with the relabeling proportion of 0% in Tab. 1. The model that only contains ResNet-50 and combinatorial loss obtained a result of 84.32. As more components of HRN are integrated into the model, we gradually achieve better results.

Component OA
Combinatorial Loss 84.32
Combinatorial Loss + GSB 85.77
Combinatorial Loss + GSB + LC 86.17
Combinatorial Loss + GSB + LC + ReLU 86.60
Table 1: OA on the species level with the relabeling proportion of 0% by gradually adding each component in HRN: granularity-specific block (GSB), the linear combination of hierarchical features (LC), and non-linear transformation of combined features (ReLU).

4.4.2 Contribution of Combinatorial Loss

In this subsection, we validate the effectiveness of combining the probabilistic classification loss () with the multi-class cross-entropy loss (). Tab. 2 records OA on the species level with five relabeling proportions. In Tab. 2, it can be found that when more training samples are relabeled to coarse-grained classes, the fine-grained classification performance of degenerates drastically. In contrast, the combinatorial loss consistently outperforms by adding imposed on fine-grained leaf classes.

Relabeling 0% 30% 50% 70% 90%
84.56 76.66 64.36 45.10 28.69
+ 86.60 83.91 80.52 73.96 53.02
Table 2: OA on the species level by analyzing the effectiveness of combining the probabilistic classification loss () with the multi-class cross-entropy loss ().

4.4.3 Visualization Experiments

We conduct visualization experiments to demonstrate that granularity-specific blocks can capture different regions of interest while hierarchical knowledge can be transferred across levels via hierarchical residual interaction and supervision of combinatorial loss. To this end, we adopt Grad-Cam to visualize different attention regions of each hierarchical level by propagating their respective gradients back to feature maps generated from the trunk network. Fig. 3 illustrates two species of birds: House Wren and Marsh Wren from the same family (Troglodytidae) and order (Passeriformes). Domain knowledge reveals that Marsh Wrens have a more distinctive eyebrow than House Wrens. In Fig. 3, the order level focuses on the head and legs, and the family level pays close attention to the head and feathers on the wing and tail. In addition to feathers, the species level concentrates on the head. Three hierarchical levels give similar attention to the head with a distinctive eyebrow, while they have unique regions of interest.

(a) House Wren
(b) Marsh Wren
Figure 3: Visual attention maps of two species of birds from the same family (Troglodytidae) and order (Passeriformes).
Relabeling Hierarchy HMC-LMLP [3] HMCN [33] C-HMCNN [10] Chang et al. [4] Ours
0% Order 98.45 0.945 97.29 0.934 98.48 0.960 97.76 0.968 98.67 0.969
Family 94.24 93.15 94.63 94.17 95.51
Specie 79.60 79.75 81.58 85.56 86.60
30% Order 98.17 0.920 96.82 0.905 97.98 0.938 97.81 0.962 98.31 0.958
Family 93.58 91.99 93.89 94.10 94.79
Specie 71.30 71.68 74.91 82.53 83.91
50% Order 98.36 0.895 96.70 0.874 98.34 0.909 97.43 0.951 97.89 0.944
Family 93.84 90.85 94.10 93.47 94.29
Specie 64.34 64.29 67.52 79.30 80.52
70% Order 98.27 0.831 97.22 0.834 98.02 0.844 96.65 0.924 98.43 0.936
Family 93.84 91.25 93.91 91.74 93.94
Specie 47.98 52.90 50.05 70.03 73.96
90% Order 98.38 0.716 97.31 0.725 98.27 0.772 97.12 0.868 97.97 0.865
Family 94.44 86.85 94.37 91.91 93.32
Specie 22.89 30.69 26.16 49.36 53.02
Table 3: OA(%)/ results on CUB-200-2011 by comparing to state-of-the-art methods.
Relabeling Hierarchy HMC-LMLP [3] HMCN [33] C-HMCNN [10] Chang et al. [4] Ours
0% Maker 97.09 0.968 96.07 0.959 97.45 0.979 96.88 0.981 97.45 0.976
Family 94.39 92.56 95.41 95.28 95.79
Model 90.25 87.19 91.69 91.92 92.58
30% Maker 96.85 0.950 96.13 0.952 96.76 0.971 87.41 0.957 97.27 0.970
Family 93.34 92.74 94.27 94.44 95.52
Model 85.42 85.42 88.39 89.33 91.62
50% Maker 97.24 0.925 95.71 0.935 96.49 0.963 73.56 0.909 97.27 0.965
Family 93.82 92.05 93.88 94.17 95.67
Model 83.59 81.52 85.18 86.66 89.66
70% Maker 96.97 0.898 95.80 0.900 96.67 0.953 58.77 0.816 96.75 0.953
Family 93.70 90.49 94.00 93.78 94.20
Model 81.61 78.37 80.11 82.96 84.53
90% Maker 96.97 0.870 93.40 0.824 96.76 0.903 49.88 0.656 95.43 0.904
Family 93.37 89.50 94.36 93.72 91.68
Model 74.41 70.06 71.02 64.99 71.06
Table 4: OA(%)/ results on Aircraft by comparing to state-of-the-art methods.

4.5 Comparison with State-of-the-art Methods

To fairly evaluate the proposed method, we compare it to state-of-the-art HMC methods: HMC-LMLP [3], HMCN [33], and C-HMCNN [10], and the state-of-the-art FGVC approach that exploits the label hierarchy: Chang et al[4]. In our hierarchical settings, we train all methods with different relabeling proportions. Chang et al[4] sum the multi-class cross-entropy loss from each hierarchical level. When adapting their approach to hierarchical settings, we neglect the last-level loss if a sample has been relabeled to its parent class. We report OA of each hierarchical level and results on test sets of three FGVC datasets: CUB-200-2011, Aircraft, and Stanford Cars, displayed in Tab. 3, Tab. 4, and Tab. 5, respectively.

From Tab. 3, Tab. 4, and Tab. 5, we can observe that the proposed method achieves the best OA results of each hierarchical level and the best results in most cases. In other cases, our results are also comparable to the best results. Chang et al[4] use level-specific classification heads to disentangle coarse-level features with fine-grained ones, but they only consider mutually exclusion in each hierarchical level without examining subsumption relations between hierarchical levels in their loss function. C-HMCNN only constrains subsumption relations. HMC-LMLP and HMCN embed label hierarchy in their network architecture and train with the binary cross-entropy loss that implies all classes are independent. By contrast, in our framework, the tree hierarchy specifies the relation between any two labels with mutually exclusion or subsumption, and the corresponding probabilistic loss combined with the multi-class cross-entropy loss can transfer hierarchical knowledge during training. In addition, the proposed HRN disentangles hierarchical features by granularity-specific blocks, and these features interact via residual connections to fuse attributes following the hierarchy.

Relabeling Hierarchy HMC-LMLP [3] HMCN [33] C-HMCNN [10] Chang et al. [4] Ours
0% Type 96.98 0.953 95.21 0.938 96.75 0.971 96.40 0.977 97.41 0.981
Maker 87.65 88.71 90.64 93.65 94.03
30% Type 96.85 0.909 94.38 0.887 96.23 0.927 96.23 0.970 96.13 0.969
Maker 79.16 81.59 81.92 91.61 90.55
50% Type 96.92 0.842 93.46 0.832 95.95 0.850 95.60 0.960 95.88 0.963
Maker 66.45 73.03 70.22 88.10 88.72
70% Type 96.89 0.705 93.02 0.713 95.67 0.708 92.90 0.905 96.06 0.947
Maker 41.52 52.66 43.17 76.13 83.72
90% Type 96.38 0.572 93.42 0.560 96.49 0.577 92.25 0.761 94.32 0.794
Maker 13.51 19.89 13.54 45.79 49.30
Table 5: OA(%)/ results on Stanford Cars by comparing to state-of-the-art methods.

4.6 Analyze State-of-the-art Methods by Reducing Image Resolution

Except for domain knowledge, samples captured at low-resolution can hardly be identified with the last-level fine-grained categories, and thus they are more likely to be inferred as upper-level coarse classes. Considering the practical limitation of image quality, we reduce the image resolution of selected samples corresponding to different relabeling proportions. Tab. 6 displays the experimental results, and our method consistently outperforms compared methods in most cases under two evaluation metrics.

Relabeling Hierarchy HMC-LMLP [3] HMCN [33] C-HMCNN [10] Chang et al. [4] Ours
0% Order 98.45 0.945 97.29 0.934 98.48 0.960 97.76 0.968 98.67 0.969
Family 94.24 93.15 94.63 94.17 95.51
Specie 79.60 79.75 81.58 85.56 86.60
30% Order 97.86 0.926 96.32 0.887 97.81 0.944 97.62 0.961 98.50 0.959
Family 93.18 88.06 93.48 93.59 94.75
Specie 74.32 70.78 76.04 82.33 84.13
50% Order 97.45 0.907 95.32 0.853 97.62 0.925 97.12 0.947 98.20 0.952
Family 92.25 85.93 92.51 91.79 93.82
Specie 68.10 62.70 70.37 78.30 81.18
70% Order 97.62 0.862 94.43 0.789 97.20 0.881 96.32 0.909 97.58 0.926
Family 91.72 82.64 91.18 88.84 92.42
Specie 53.11 45.75 55.14 68.06 73.98
90% Order 96.67 0.695 93.56 0.694 96.79 0.801 96.06 0.843 96.15 0.837
Family 89.96 78.60 89.44 87.52 88.29
Specie 20.78 22.52 28.32 46.58 50.10
Table 6: Compared OA(%)/ results on CUB-200-2011 by reducing the image resolution after relabeling.

4.7 Comparison with FGVC Methods Exploiting Hierarchical Knowledge

Considering hierarchical knowledge, FGVC approaches refine the feature representation related to hierarchical levels in the feature space [35, 28, 6, 4], e.g., measuring the distance between classes in the hierarchy [35, 28], learning finer-grained features with the prediction of higher level [6], or disentangling coarse-level features with fine-grained ones [4]. Nevertheless, they develop their loss functions based on the multi-class cross-entropy loss, which implies mutual exclusion at the same hierarchical level. On the other hand, encoding label relations like the parent-child correlation helps to utilize samples observed at different levels. In contrast, the proposed method specifies label relations with the tree hierarchy and computes the combinatorial loss to effectively exploit samples labeled at different levels.

We record the best results reported in their works in which each sample has complete hierarchical multi-granularity labels. In Tab. 7, Chang et al[4] achieve state-of-the-art performances in the traditional single-label FGVC problem. Our approach reaches comparable results by simply replacing ResNet-50 with ResNeXt101-324d [34]. Other techniques that enrich the feature representation in the context of FGVC can be applied to boost the performance, which is beyond our scope.

Method CUB-200-2011 Aircraft Stanford Cars
Zhang et al. [35] 88.4
Shi et al. [28] 77.0 84.6 89.5
Chen et al. [6] 88.1
Chang et al. [4] 89.9 93.6 95.1
Ours 88.6 94.1 95.1
Table 7: OA on the species level by comparing to other FGVC approaches that exploit the label hierarchy.

5 Conclusion

We study the HMC problem in which different objects can be discerned at various levels in the label hierarchy due to the differences in domain knowledge or image quality. To address this problem, we propose combinatorial loss and HRN. The combinatorial loss combines the probabilistic classification loss defined on the tree hierarchy that encodes semantic relations between any two hierarchical labels with the multi-class entropy loss imposed on the fine-grained leaf categories. The probabilistic classification loss can transfer hierarchical knowledge across levels, and the multi-class entropy loss increases the discriminative power on the leaf classes. In addition, HRN manages to perform hierarchical feature interaction via residual connections, i.e., features from parent levels acting as skip connections are added to features of children levels. Comprehensive experiments on three commonly used datasets demonstrated the effectiveness of the proposed method compared to state-of-the-art HMC and FGVC methods.

References

  • [1] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya. Hierarchical multi-label prediction of gene function. BMC Bioinform., 22(7):830–836, 2006.
  • [2] R. Cerri, R.C. Barros, and A. C. P. L. F. de Carvalho. Hierarchical multi-label classification using local neural networks. J. Comput. Syst. Sci., 80(1):39–56, 2014.
  • [3] R. Cerri, R. C. Barros, A. C. P. L. F. de Carvalho, and Y. Jin. Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinformat., 17(1):373, 2016.
  • [4] Dongliang Chang, Kaiyue Pang, Yixiao Zheng, Zhanyu Ma, Yi-Zhe Song, and Jun Guo. Your” flamingo” is my” bird”: Fine-grained, or not. In CVPR, pages 11476–11485, 2021.
  • [5] B. Chen, X. Huang, L. Xiao, Z. Cai, and L. Jing. Hyperbolic interaction model for hierarchical multi-label classification. In AAAI, volume 34, pages 7496–7503, 2020.
  • [6] Tianshui Chen, Wenxi Wu, Yuefang Gao, Le Dong, Xiaonan Luo, and Liang Lin. Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In ACM MM, pages 2023–2031, 2018.
  • [7] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale object classification using label relation graphs. In ECCV, pages 48–64, 2014.
  • [8] I. Dimitrovski, D. Kocev, S. Loskovska, and S. Dzeroski. Hierarchical annotation of medical images. Pattern Recognit., 44(10-11):2436–2449, 2011.
  • [9] I. Dimitrovski, D. Kocev, S. Loskovska, and S. Džeroski. Hierarchical classification of diatom images using ensembles of predictive clustering trees. Ecological Informat., 7(1):19–29, 2012.
  • [10] E. Giunchiglia and T. Lukasiewicz. Coherent hierarchical multi-label classification networks. In NeurIPS, volume 33, 2020.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645. Springer, 2016.
  • [12] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV, pages 1314–1324, 2019.
  • [13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [14] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
  • [15] C. N. Silla Jr and A. A. Freitas. A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov., 22(1-2):31–72, 2011.
  • [16] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In CVPR workshops, pages 554–561, 2013.
  • [17] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5(Apr):361–397, 2004.
  • [18] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [19] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, pages 116–131, 2018.
  • [20] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  • [21] Y. Mao, J. Tian, J. Han, and X. Ren. Hierarchical text classification with reinforced label assignment. In EMNLP-IJCNLP, pages 445–455, 2019.
  • [22] A. Mayne and R. Perry. Hierarchically classifying documents with multiple labels. In Proc. IEEE Symp. Comput. Intell. Data Mining, pages 133–139. IEEE, 2009.
  • [23] Y. Meng, J. Shen, C. Zhang, and J. Han. Weakly-supervised hierarchical text classification. In AAAI, volume 33, pages 6826–6833, 2019.
  • [24] H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. In WWW, pages 1063–1072, 2018.
  • [25] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of hierarchical multilabel classification models. J. Mach. Learn. Res., 7:1601–1626, 2006.
  • [26] L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, and S. Džeroski.

    Predicting gene function using hierarchical multi-label decision tree ensembles.

    BMC Bioinformat., 11(1):2, 2010.
  • [27] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
  • [28] Weiwei Shi, Yihong Gong, Xiaoyu Tao, De Cheng, and Nanning Zheng. Fine-grained image classification using modified dcnns trained by cascaded softmax and generalized large-margin losses. NeurIPS, 30(3):683–694, 2018.
  • [29] W. Huang et al. Hierarchical multi-label text classification: An attention-based recurrent network approach. In Proc. 28th ACM Int. Conf. Inf. Knowl. Manag., pages 1051–1060, 2019.
  • [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  • [31] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel. Decision trees for hierarchical multi-label classification. Mach. Learn., 73(2):185–214, 2008.
  • [32] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [33] J. Wehrmann, R. Cerri, and R.C. Barros. Hierarchical multi-label classification networks. In ICML, pages 5075–5084, 2018.
  • [34] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017.
  • [35] Xiaofan Zhang, Feng Zhou, Yuanqing Lin, and Shaoting Zhang. Embedding label structures for fine-grained feature representation. In CVPR, pages 1114–1123, 2016.
  • [36] Feng Zhou and Yuanqing Lin. Fine-grained image classification by exploring bipartite-graph labels. In CVPR, pages 1124–1133, 2016.