We propose techniques to incorporate coarse taxonomic labels to train image classifiers in fine-grained domains. Such labels can often be obtained with a smaller effort for fine-grained domains such as the natural world where categories are organized according to a biological taxonomy. On the Semi-iNat dataset consisting of 810 species across three Kingdoms, incorporating Phylum labels improves the Species level classification accuracy by 6 learning setting using ImageNet pre-trained models. Incorporating the hierarchical label structure with a state-of-the-art semi-supervised learning algorithm called FixMatch improves the performance further by 1.3 relative gains are larger when detailed labels such as Class or Order are provided, or when models are trained from scratch. However, we find that most methods are not robust to the presence of out-of-domain data from novel classes. We propose a technique to select relevant data from a large collection of unlabeled images guided by the hierarchy which improves the robustness. Overall, our experiments show that semi-supervised learning with coarse taxonomic labels are practical for training classifiers in fine-grained domains.READ FULL TEXT VIEW PDF
Large labeled datasets have been the key to the success of deep networks for many tasks. However, labeling requires expertise and can be time-consuming, especially for fine-grained recognition tasks such as identifying the species of birds [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] or variants of aircrafts [Maji et al.(2013)Maji, Rahtu, Kannala, Blaschko, and Vedaldi]. In this work, we investigate if coarsely labeled datasets can be used to improve performance of a target fine-grained recognition task. For example, in natural domains one can often obtain a large dataset with the same coarse labels as the target task through community driven platforms such as iNaturalist [Van Horn et al.(2018)Van Horn, Mac Aodha, Song, Cui, Sun, Shepard, Adam, Perona, and Belongie]. Effectively incorporating them in a semi-supervised learning framework to improve performance could be a compelling alternative to existing few-shot learning approaches, which have been less effective in fine-grained domains [Su et al.(2021)Su, Cheng, and Maji, Cole et al.(2021)Cole, Yang, Wilber, Aodha, and Belongie].
We present an analysis on the Semi-iNat dataset [Su and Maji(2021)] that consists of images from 810 species spanning three Kingdoms and eight Phyla (Figure 1 and Table 1). The dataset contains: (i) a small set of images labeled at the species level (in-class), (ii) a large set (9) of coarsely-labeled images from the same species (in-class), and (iii) an even larger set (32) of coarsely-labeled images from novel species within the same taxonomy (out-of-class). At test time, the species classification accuracy is measured on novel images of the set of species within the labeled set (in-class). The dataset is fine-grained and long-tailed, posing challenges to existing approaches for semi-supervised learning.
For a consistent evaluation, we present results using a ResNet-50 network trained from scratch or pre-trained on ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] using various semi-supervised and self-supervised learning approaches. For the supervised Baseline with ImageNet pre-training, incorporating a hierarchical loss at the Phylum level consisting of eight categories improves the Top-1 accuracy from 40.4% to 46.6% (Figure 1 and Table 2). This beats the gains using semi-supervised learning with FixMatch [Sohn et al.(2020)Sohn, Berthelot, Li, Zhang, Carlini, Cubuk, Kurakin, Zhang, and Raffel] alone, which obtains 44.1%. However, the gains are complementary and combining the two improves the performance to 47.9%. Coarse taxonomic labels are also useful when models are trained from scratch, improving the performance of the best method from 32.0% to 34.5% (Table 2). Figure 1 quantifies the gains obtained with the Baseline and FixMatch using supervision from the Kingdom level (3 categories) to the Species level (810 categories, full supervision). Kingdom and Phylum supervision provides consistent gains for both methods, and improves over the semi-supervised learning without the coarse labels. The benchmark is far from saturated as indicated by the performance of the fully-supervised Baseline at roughly 86%.
A common assumption in semi-supervised learning is that the weakly-labeled data belongs to the same set of classes as the target task. This is hard to guarantee in practice as coarsely labeled datasets collected in-the-wild may contain novel classes. We find that the presence of this out-of-domain data leads to a drop in performance for nearly all approaches. For example, the performance of FixMatch drops from 47.9% to 41.1% (Table 3) when the larger set of images are included for semi-supervised learning. This is problematic because labeling if an image contains novel classes is significantly more challenging than obtaining a large pool of images with the same coarse labels. We find that prior work based on importance weighting based on a domain classifier (e.g, [Su et al.(2020)Su, Maji, and Hariharan]) is less effective, but using the hierarchy to exclude the novel categories leads to a small improvement in some cases.
In summary, we show that coarse labels improve supervised and semi-supervised learning in fine-grained natural taxonomies. In particular, we find that: (i) coarse labels can be incorporated in several state-of-the-art methods to boost their performance; (ii) The improvements are greater with more fine-grained labels; (iii) The presence of novel classes hurts performance, but this can be somewhat mitigated by techniques for detecting domain shifts. However, the marginal gains obtained in this setting highlights the difficulty of detecting novel classes in fine-grained domains. The code is available at https://github.com/cvl-umass/ssl-evaluation.
Semi-supervised learning aims to use weakly labeled data to improve the model generalization. Self-training approaches proposed in some early work [McLachlan(1975), Scudder(1965)] use the model’s own prediction to generate labels. Their modern incarnations include pseudo-labeling [Lee(2013)] which uses confident predictions as target labels for unlabeled data. Such labels can also be added gradually as a form of curriculum learning [Bengio et al.(2009)Bengio, Louradour, Collobert, and Weston, Cascante-Bonilla et al.(2021)Cascante-Bonilla, Tan, Qi, and Ordonez] to reduce model drift. UPS [Rizve et al.(2021)Rizve, Duarte, Rawat, and Shah]
generalizes this idea and incorporates low-probability predictions as negative pseudo-labels for multi-label prediction tasks. Other methods use a combination of self-supervised and semi-supervised learning techniques[Zhai et al.(2019)Zhai, Oliver, Kolesnikov, and Beyer, Gidaris et al.(2019)Gidaris, Bursuc, Komodakis, Pérez, and Cord, Su et al.(2020)Su, Maji, and Hariharan], which is sometimes followed by an additional step where the model’s predictions are used to train a “student model” using distillation [Xie et al.(2020b)Xie, Luong, Hovy, and Le, Yalniz et al.(2019)Yalniz, Jégou, Chen, Paluri, and Mahajan, Zoph et al.(2020)Zoph, Ghiasi, Lin, Cui, Liu, Cubuk, and Le, Chen et al.(2020b)Chen, Kornblith, Swersky, Norouzi, and Hinton]. Consistency-based approaches enforce the similarity of predictions between two augmentations of the same data, or different models in an ensemble, as a form of regularization [Bachman et al.(2014)Bachman, Alsharif, and Precup, Rasmus et al.(2015)Rasmus, Berglund, Honkala, Valpola, and Raiko, Laine and Aila(2017), Sajjadi et al.(2016)Sajjadi, Javanmardi, and Tasdizen, Miyato et al.(2018)Miyato, Maeda, Koyama, and Ishii]. The role of augmentations have been explored in detail in techniques such as MixMatch [Berthelot et al.(2019)Berthelot, Carlini, Goodfellow, Papernot, Oliver, and Raffel], ReMixMatch [Berthelot et al.(2020)Berthelot, Carlini, Cubuk, Kurakin, Sohn, Zhang, and Raffel], FixMatch [Sohn et al.(2020)Sohn, Berthelot, Li, Zhang, Carlini, Cubuk, Kurakin, Zhang, and Raffel], and UDA [Xie et al.(2020a)Xie, Dai, Hovy, Luong, and Le], which combine geometric and photometric image augmentations with other techniques such as MixUp [Zhang et al.(2018)Zhang, Cisse, Dauphin, and Lopez-Paz].
The hierarchical structure of the label space can be used to improve classification performance in many ways [Ristin et al.(2015)Ristin, Gall, Guillaumin, and Van Gool, Taherkhani et al.(2019)Taherkhani, Kazemi, Dabouei, Dawson, and Nasrabadi, Guo et al.(2018)Guo, Liu, Bakker, Guo, and Lew]. A common approach is to frame the problem as a structured prediction task and use a graphical model that incorporates the label structure with standard Bayesian machinery. For example, Deng et al [Deng et al.(2014)Deng, Ding, Jia, Frome, Murphy, Bengio, Li, Neven, and Adam] exploit the inclusion and exclusion relations among the labels to improve classification. YOLOv2 [Redmon and Farhadi(2017)] learns object detectors across a large set of categories by predicting labels on the taxonomy in a top-down manner by learning a conditional distribution of the leaves given the parent category. This approach parameterizes the learnable weights along the edges of the tree. We instead parameterize the weights along the leaves of the tree. Other works predict fine-grained labels by designing models that predict the labels [Yan et al.(2015)Yan, Zhang, Piramuthu, Jagadeesh, DeCoste, Di, and Yu, Zhu and Bain(2017)] or learn features [Wang et al.(2015)Wang, Shen, Shao, Zhang, Xue, and Zhang] across different levels in the hierarchy in a multi-task framework. The hierarchical label space can also be utilized to detect novel classes. For example, given coarse labels, Hsieh et al [Hsieh et al.(2019)Hsieh, Xu, Niu, Lin, and Sugiyama] learn to assign fine-grained pseudo-labels by meta-learning, assuming that the classifier can achieve the best performance when the missing labels are correctly recovered. Another application is zero-shot learning where attributes of the novel classes are provided, but there are no training images in novel classes [Samplawski et al.(2020)Samplawski, Learned-Miller, Kwon, and Marlin, Lee et al.(2018)Lee, Lee, Min, Zhang, Shin, and Lee]
. Recently, coarse labels have been incorporated in contrastive learning to improve image retrieval[Touvron et al.(2021)Touvron, Sablayrolles, Douze, Cord, and Jégou] and few-shot learning [Bukchin et al.(2021)Bukchin, Schwartz, Saenko, Shahar, Feris, Giryes, and Karlinsky, Phoo and Hariharan(2021)]. Unlike prior work, we investigate if hierarchical labels can be used for improving of semi-supervised learning, for example by constraining the label space of approaches such as FixMatch or Pseudo-Labeling.
We focus on a structured prediction task where the label space has a hierarchical structure. It corresponds to a tree-structured biological taxonomy with 7 levels corresponding to the Kingdom, Phylum, Class, Order, Family, Genus, and Species. Denote as the label of an instance at the level . Thus, is the label at the Kingdom level, for the label in the Phylum level, and the leaf nodes in the tree correspond to Species-level labels denoted by . Similarly, denote the sets of label space at a level as . For example, the label space in the species level is , and in the Phylum level is . Given the label in a level, we can infer the labels in all the upper levels using the tree structure, e.g, we can infer the Kingdom label given the Phylum label. The Semi-iNat [Su and Maji(2021)] dataset provides Species-level labels for a subset of images, but coarse labels (e.g, Kingdom and Phylum) for a larger set of images (Table 1). Performance is measured as the accuracy at the Species level on novel images.
We train a model to predict the labels at the finest level, i.e, , and marginalize over the tree structure to obtain probabilities at any level. There are several alternatives to this schema. For example, we could train separate heads for each level instead of just the leaves. However, this performed worse, obtaining 42.1% accuracy on a supervised model trained from ImageNet, while our proposed method achieves 46.6% accuracy. We could also consider the parameterization proposed in YOLOv2 [Redmon and Farhadi(2017)] which models the conditional distribution of the leaves given the parent for each internal node in the tree. Both schemes require many more parameters. For example, there are 2041 edges in the taxonomy (summing over Table 1 right); thus, the edge-parameterization of YOLOv2 would require 2041 weights (#edges) compared to 810 weights (#leaves). While the edge parameterization can handle arbitrary graphs, it offers no obvious advantage for tree-structured models. Having fewer weights may be preferable in the few-shot setting.
We consider the supervised cross-entropy loss in each level of the hierarchy. For labeled data , the model first predicts the label space of the species with the probabilities . For coarse labeled data, say we have the label at the Phylum level , we first use the same model to predict the labels in the species level . We then apply a cross-entropy loss over the model’s prediction at the Phylum level obtained by summing the probabilities of all the leaf nodes under each Phylum. The marginalization can be done by , where the predefined matrix represents the edges between the Species and Phylum level (the elements are 1 for the edges and 0 otherwise).
During training, we sample labeled data from and coarsely labeled data from
in each batch for stochastic gradient descent. For labeled data, we only add supervised loss on the lowest level, which is the Species level. The complete hierarchical supervised loss is:
where is the cross-entropy function . The first term is the loss for labeled data on the Species level, and the second term is the loss for coarsely labeled data on the Phylum level. The superscript on the loss represents the level of supervision for labeled and coarsely labeled data. In the ablation studies, we will investigate the effect of using different levels of supervision. Note that we can add supervised loss on all seven levels of the taxonomy. However, in our experiments we did not find it useful to add losses at a level coarser than which the supervision was provided, e.glosses at the Genus level or higher when Species labels are provided. Our method can also be extended to general hierarchical graphs such as WordNet using marginalization methods [Deng et al.(2014)Deng, Ding, Jia, Frome, Murphy, Bengio, Li, Neven, and Adam, Samplawski et al.(2020)Samplawski, Learned-Miller, Kwon, and Marlin].
In addition to the hierarchical loss, we add semi-supervised losses such as consistency regularization, entropy minimization, or pseudo-labeling on the Species level for coarsely labeled data. We select representative semi-supervised methods including Pseudo-Label, FixMatch, Self-Training with distillation, and Self-Supervised learning (MoCo) with distillation. We describe how we incorporate hierarchical supervisions with these methods in the following.
Pseudo-label uses the model’s predictions as labels if the prediction is higher than a threshold . Denote the pseudo-label in the Species level as , then the loss for pseudo-label training is:
FixMatch utilized two different augmentation functions for consistency training, one with weak augmentation and one with strong augmentation . For each coarsely labeled image , the KL distance between the pseudo-label from weakly-augmented image and the prediction of strongly-augmented image is minimized. To compute the supervised loss for labeled data, weak augmentation is used. For the supervised loss on coarsely labeled data , we use strongly-augmented images since weakly-augmented images are only used for generating pseudo-labels without back-propagation. The final loss is:
We use model distillation [Hinton et al.(2015)Hinton, Vinyals, and Dean] as the self-training method. Specifically, we first train a teacher model using labeled data for supervised learning . We then train a student model using distillation loss, which is the KL distance between the logits from the teacher and student models (denoted as and ). The final loss is:
where is the softmax function and is the temperature parameter.
We use Momentum Contrastive (MoCo) [He et al.(2020)He, Fan, Wu, Xie, and Girshick] for self-supervised learning on the union of labeled and coarsely labeled data (). Specifically, MoCo uses contrastive learning to minimize the representations of two different augmentations of an image. Denote as the representation of an image and as the positive sample, which is another augmentation of the same image . The negative samples are sampled from a memory bank. The InfoNCE [Oord et al.(2018)Oord, Li, and Vinyals] loss for the query is:
where is a temperature hyper-parameter. The encoder for the memory bank is updated based on momentum of the encoder to stabilize the training. After the self-supervised pre-training is complete, we replace the MLP layers (after global pooling) with a linear projection layer and fine-tune the entire model using our supervised hierarchical loss (Eq. 1). Alternatives to MoCo such as SimCLR [Chen et al.(2020a)Chen, Kornblith, Norouzi, and Hinton] and BYOL [Grill et al.(2020)Grill, Strub, Altché, Tallec, Richemond, Buchatskaya, Doersch, Pires, Guo, Azar, et al.] could be used, but their effect is somewhat mitigated by the fact that ImageNet pre-training is still far more effective than self-supervised learning on fine-grained domains. However, the impact may be higher when training from scratch.
This method combines the previous two methods, which is similar to the setting proposed in Chen et al [Chen et al.(2020b)Chen, Kornblith, Swersky, Norouzi, and Hinton]. The MoCo pre-trained model is followed by supervised fine-tuning on the labeled set to obtain the teacher model. This is then used to self-train a student model using the distillation loss in Eq. 4.
We use the Semi-iNat dataset [Su and Maji(2021)] from the semi-supervised challenge [sem()] at the FGVC8 workshop. The dataset contains 810 in-class species () and 1629 out-of-class species () from 3 different Kingdoms. The fully labeled data come from in-class species while the coarsely labeled data are drawn from in-class and out-of-class species, denoted as and respectively. Table 1 shows the statistics of the dataset and Figure 1 left shows the taxonomy.
We use ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] as our backbone model and an input size of 224224 for all the experiments. For all the methods except for MoCo and FixMatch, we use SGD with a momentum of 0.9 to train the model with 100k (from scratch) or 50k iterations (from expert models). The batch size is 60 for training supervised baselines. For semi-supervised methods, we sample 30 images each from labeled and coarsely labeled data with a total batch size of 60. The learning rate is searched within , and the weight decay is set for either or . The hyper-parameters are set using the validation set of Semi-iNat. This set is not included for the supervised loss but is used for training MoCo.
For MoCo, we follow the setting of MoCov2 [Chen et al.(2020c)Chen, Fan, Girshick, and He]
and use a batch size of 2048 negative samples. The training is done with a learning rate of 0.03 and 0.0003, and for 800 and 200 epochs, for training from scratch and from expert models respectively.
For FixMatch, we follow the original setting and use RandAugment [Cubuk et al.(2020)Cubuk, Zoph, Shlens, and Le] for augmentation. Due to the hardware constraints, we were limited to a batch size of 32 for labeled data and 160 for coarsely labeled data for training with 4 GPUs. When training from scratch we use a learning rate of 0.03 for 200k iterations; when training from expert models we use a learning rate of 0.001 for 100k iterations. The threshold for FixMatch is set as for all our experiments.
|Method||from scratch||from ImageNet|
|( Phylum Supervision )||w/o||w/||w/o||w/|
|FixMatch [Sohn et al.(2020)Sohn, Berthelot, Li, Zhang, Carlini, Cubuk, Kurakin, Zhang, and Raffel]||15.5||25.7||44.1||47.9|
|MoCo [He et al.(2020)He, Fan, Wu, Xie, and Girshick]||30.2||33.5||41.7||41.9|
|MoCo + Self-Training [Su et al.(2021)Su, Cheng, and Maji]||32.0||35.4||42.6||45.8|
We first consider the setting where the coarsely labeled images are within the set of labeled species (). In § 4.4 we will analyze the effect and utility of adding images of novel species from the same coarse categories. Models are initialized randomly or from an ImageNet pre-trained model. For each setting, the Baseline supervised learning and five Semi-Supervised learning methods are evaluated. We then analyze the effect of adding hierarchical loss.
Results are presented in Table 2. Adding a hierarchical loss gives almost a 10% improvement for FixMatch and 3% for all other methods when models are trained from scratch. When initialized with ImageNet pre-trained models, adding hierarchical loss provides 2-6% improvements in top-1 accuracy except for the self-supervised training method (MoCo). Figure 2 shows the confusion matrices at the Phylum level for models trained with and without hierarchical supervision. Hierarchical supervision sensibly reduces confusion among the four Phyla within the animal (A) Kingdom (e.g, Antropoda vs. Echinodermata), as well as at the Plant (P) Kingdom. The combined effect of hierarchical supervision and semi-supervised learning represents an overall improvement from 40.4% to 47.9%.
In this section, we consider supervision across different levels of the taxonomy on top of FixMatch and the supervised baseline. For example, if we have the labels in the Order level for coarsely labeled data, i.e, then the hierarchical loss becomes . We use the ground-truth labels (on the Species level) of the coarsely labeled data which are released after the competition ends [sem()] for the analysis. As shown in Figure 1 right, we can see that finer-grained labels improve performance, but this also requires more annotation effort. The number of classes at each level in the taxonomy shown in Table 1 provides a proxy for the annotation effort. Even incorporating labels on the Kingdom level (which only has three categories) leads to improvements, while Class level (29 categories) improves the performance of FixMatch from 44.1% (FixMatch + None) to 51.8% (FixMatch + Class). In these experiments, the coarsely labeled set on which the semi-supervised losses are constructed in Eq. 3 are kept the same. Interestingly, even when the method is fully supervised, these self-supervised losses are useful, and provide an improvement over the fully-supervised Baseline. This is perhaps not as surprising as previous works have noted that the self-supervised losses improve few-shot learning [Zhai et al.(2019)Zhai, Oliver, Kolesnikov, and Beyer, Su et al.(2020)Su, Maji, and Hariharan, Gidaris et al.(2019)Gidaris, Bursuc, Komodakis, Pérez, and Cord].
|Method||from scratch||from ImageNet|
|( Phylum Supervision )||w/o||w/||w/o||w/|
|FixMatch [Sohn et al.(2020)Sohn, Berthelot, Li, Zhang, Carlini, Cubuk, Kurakin, Zhang, and Raffel]||11.0||21.1||38.5||41.1|
|MoCo [He et al.(2020)He, Fan, Wu, Xie, and Girshick]||31.8||29.4||40.8||39.3|
|MoCo + Self-Training [Su et al.(2021)Su, Cheng, and Maji]||32.9||35.4||41.5||42.6|
Next, we consider the case when there is a domain shift, i.etraining with data from . The results are shown in Table 3. When there is no hierarchical loss, the performances of semi-supervised methods all drop except for MoCo. Adding hierarchical loss improves all the methods except for MoCo, though the improvement is less compared to Table 2. Surprisingly, when training from the ImageNet model, no semi-supervised method improves over the supervised Baseline trained with the hierarchical loss. In particular, FixMatch is less robust to the domain shift, echoing the findings in [Su et al.(2021)Su, Cheng, and Maji]. One potential reason is the large domain shift between and the dataset: although the out-of-domain classes are drawn from the same Class in the taxonomy, the appearances can be significantly different. To alleviate the effect of domain shift, we propose to filter the out-of-domain data by measuring the uncertainty in the model’s predictions. This step is performed before any semi-supervised training. Similar to pseudo-labeling, we first use the baseline supervised model to generate predictions for images in , then check if the maximum probability is greater than . Note that the set of out-of-domain classes () remains unknown. We further check if the predicted labels match the provided coarse labels at the Phylum level to filter out-of-domain images.
After filtering out-of-domain data, we train FixMatch using the selected coarse-labeled images. Using the supervision at the Phylum level obtains 42.0% accuracy compared to 41.1% without any domain selection. This allows us to use uncurated data, though the performance is lower than having only in-domain data (47.9%). Our proposed approach is simple, and there is significant room to improve this using better out-of-domain detection techniques and modeling the taxonomy. Nevertheless, the problem is challenging in fine-grained domains.
We showed that coarse labels improve semi-supervised learning on fine-grained image classification tasks. Thus, collecting coarse labels might provide a practical way to train models on new domains where a vast number of images are readily available, but labeling effort is expensive. We also found that out-of-domain (or novel class) data leads to a drop in performance. Hierarchical labels help, but the task of selecting relevant images can be challenging in fine-grained domains. Our model based on uncertainty and filtering using coarse labels only provides modest improvements. Improved techniques for detecting out-of-domain data combined with taxonomy-aware user input could provide further benefits.
This project is supported in part by NSF #1749833 and was performed using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative.
International Conference on Machine Learning (ICML), 2009.
AAAI Conference on Artificial Intelligence (AAAI), 2021.
Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks.In Workshop on challenges in representation learning, ICML, 2013.
Hierarchical novelty detection for visual object recognition.In Computer Vision and Pattern Recognition (CVPR), 2018.
Hd-cnn: hierarchical deep convolutional neural networks for large scale visual recognition.In International Conference on Computer Vision (ICCV), 2015.