With the emergence of research on deep Convolutional Neural Networks (CNNs), the performance of image recognition has witnessed incredible progress. The performance benefit is mainly conferred by the use of the large datasets and therefore easily lead to significant cost, since labeling data often requires human labor.
Semi-supervised learning (SSL) mitigates the requirement for labeled data with utilizing unlabeled data. A common assumption, which is often made implicitly during the construction of SSL benchmark datasets, is that the class distribution of labeled and unlabeled data with same categories are balanced. However, in many realistic scenarios, this assumption holds untrue, for example, that Semi-Supervised iNaturalist dataset(Semi-iNat)  has long-tailed data distribution, and more challengingly, it does not make a distinction between in-class and out-of-class unlabeled data.
Fine-grained image recognition and supervised learning on imbalanced data has been widely explored. The challenges of fine-grained recognition are mainly two-fold: discriminative region localization and fine-grained feature learning from those regions. Previous research has made impressive progresses by introducing part-based recognition frameworks, which relies on labels to identify possible object regions and extract discriminative features from each region. It is commonly observed that models trained on imbalanced data are biased towards majority classes which have numerous examples, and away from minority classes which have few examples. Various solutions have been proposed to help alleviate bias, such as re-sampling [2, 3] and re-weighting [7, 4]. All these methods rely on labels to re-balance the biased model without domain mismatch between the labeled and unlabeled data.
In contrast, semi-supervised fine-grained image recognition on class-imbalanced and domain-mismatched data has been understudied. In fact, class imbalance and domain mismatch pose further challenges in SSL where missing label information precludes rebalancing the unlabeled set and distinguish between in-class and out-of-class unlabeled data. Pseudo-labels for unlabeled data generated by a model trained on labeled data are commonly leveraged in SSL algorithms. However, pseudo-labels can be problematic if they are generated by an initial model trained on imbalanced and domain-shifted data, as well as biased toward majority classes and out-of-class data: in addition to increasing noise due to out-of-class unlabeled data mistaken for in-class categories, subsequent training with such biased pseudo-labels intensifies the bias and deteriorates the model quality. The majority of existing SSL algorithms have not been thoroughly evaluated on class-imbalanced and domain-mismatched data. Besides, these algorithms all research on standard SSL image recognition benchmarks [20, 1, 17, 22] instead of fine-grained image recognition benchmarks.
In this work, we investigate SSL in the context of domain-shifted and imbalanced semi-supervised fine-grained recognition, as illustrated in Fig.1. We observe that the undesired performance of existing SSL algorithms on Semi-iNat dataset is mainly due to the highly similar classes, class imbalance, and domain mismatch between the labeled and unlabeled data. We calculate the results on Semi-iNat dataset produced by FixMatch  that is a representative SSL algorithm with state-of-the-art performance on balanced SSL benchmarks. In addition to obtaining low accuracy overall on the balanced test set, the model lacks the ability to identify fine-grained features and introduces noisy data.
With this in mind, this paper introduces a bilateral-branch self-training framework (BiSTF), which trains unbalanced data through a bilateral -branch structure, and samples pseudo-labeled data while maintaining the same data distribution through a stochastic epoch update strategy. In order to improve the fine-grained learning ability of the model, BiSTF utilizes a backbone with an attention mechanism. Rather than updating the labeled set in each generation, we instead use a stochastic update epoch strategy to moderate the noise from out-of-class data in which the frequency of update increases as training progresses. In addition, to avoid the model being biased towards the majority class, the proposed method samples the unlabeled data with the same distribution as the labeled data set and add it into the labeled set to retrain an SSL model for next generation.
We show in experiments that BiSTF improves over baseline SSL method by a large margin. On Semi-iNat dataset , our method outperforms FixMatch  by as much as 10.25% in accuracy. Extensive ablation study further demonstrates that our method particularly helps to improve ability to extract pseudo-labeled data from domain-shifted unlabeled data, making it a viable solution for class-imbalanced and domain-shifted semi-supervised fine-grained image recognition.
2 Related work
2.1 Semi-supervised learning
. Many of these SSL methods share similar basic techniques, such as pseudo-labeling, or consistency regularization, combined with deep learning. Pseudo-labeled[14, 17]
uses the pseudo-labeled target predicted by the model itself to train a classifier with unlabeled data. Consistency regularization[16, 13] promotes the consistency of predictions between different views through soft [16, 1, 13] or hard  pseudo-labels, thereby learning classifiers. The performance of recent SSL methods depends on the quality of pseudo-labels. However, none of the above works has studied SSL in an class-imbalanced dataset, in which model bias significantly threaten the quality of pseudo-labels.
2.2 Class-imbalanced supervised learning
Research on class-imbalanced learning has attracted increasing attention for supervised situation. Prominent work includes re-sampling [5, 2] and re-weighting [11, 7], which adjust the network training by rebalancing the contribution of each class in expectation closer to the test distributions, while others focus on re-weighting each instance . These methods assume that all labels of data fed into the model are available during training phase, and due to the missing label information in the SSL scenario, the performance is largely unknown.
2.3 Class-imbalanced semi-supervised learning
Although SSL has been extensively studied, it is still underexplored for class imbalanced data. Recently,  shown that the use of SSL and self-supervised learning can be beneficial to class imbalanced learning.  proposed a method to suppress the loss of minority classes by suppressing the consistency loss. Although these works have done some research on SSL under unbalanced data distribution, there is no more discussion for domain mismatch, neither for fine-grained recognition.
In this section, we first introduce Semi-iNat, a challenging dataset for semi-supervised recognition. Next, we set up the problem and introduce the baseline supervised algorithm leveraged by the official and SSL algorithms. Then we investigate the misclassified and biased behavior of existing SSL algorithms on Semi-iNat. Based on these observations, we propose a bilateral-branch self-training framework that fine-tunes model by in-class supervised branch to avoid noise from mistaken pseudo-labels, and takes advantage of, rather than suffers from, the model’s bias to enhance performance on minority classes.
Different from standard SSL image recognition benchmarks, Semi-iNat dataset  is full of challenges for semi-supervised recognition with fine-grained categories, a long-tailed distribution of classes, and domain mismatch between labeled and unlabeled data.
Classes Semi-iNat contains images of species from three kingdoms in the natural taxonomy: Animal, Plants, and Fungi (Tab.2).
Split This dataset is at a larger scale for a total of k images. Specially, it is split into two sets with 810 in-class species and with 1629 out-of-class species. For each species in , images are selected for validation, public test, and private test set. Among the rest of the images, around of the images are sampled as unlabeled data and the rest as labeled data . In addition, each class is guaranteed to have at least 5 labeled images. For species in , all of them are included in . The two sets of unlabeled data are then combined , and more challengingly, no domain labels are provided but coarse taxonomic labels for the unlabeled data are provided, such as kingdom and phylum. The statistics of the class distribution has shown in Fig.1.
3.2 Problem setup and baselines
First of all, we set up the problem of domain-shifted and class-imbalanced semi-supervised fine-grained recognition.
Considering the labeled set , let denote a training sample and is its corresponding label for a C-class recognition task. Without loss of generality, this paper assumes that the number of training examples in of class c is denoted as , i.e., , and sorted by cardinality in descending order, i.e., . Evidently, due to the long-tailed distribution, the marginal class distribution of
is skewed, i.e.,. We use imbalance ratio to measure the degree of class imbalance, . Besides the labeled set , an unlabeled set that misses label information and does not match the domain with labeled set is also provided. Given sets and , our goal is to learn a classifier that generalizes well under the class-balanced public test and private test criterion.
The official of Semi-iNat presents a result of fully-supervised model on the labeled set using ResNet-50 
models trained from ImageNet pre-trained model. They built this general recognition network with basic training strategies. Many existing state-of-the-art SSL methods assign a pseudo-label with the classifier’s predictionto leverage unlabeled data. The classifier is then optimized on both labeled and selected unlabeled samples with corresponding high-confidence pseudo-labels. Therefore, the quality of pseudo-labels is crucial to the performance of final SSL algorithms. These SSL algorithms work successfully on standard class-balanced benchmarks because the quality of the classifier improves during the training process, owing to the addition of pseudo-labeled data.
However, when the classifier is biased due to the shifted domain and a skewed class distribution at the beginning of training, the online pseudo-labels of unlabeled data may be even more biased, which may further aggravate the domain mismatch and class imbalance issue and result in weakened performance on labeled set.
3.3 How baseline performs on Semi-iNat?
Instead of extending the protocol, that is utilizing various class-imbalanced ratios to produce long-tailed versions of benchmark datasets, such as CIFAR, and retaining a fraction of training data as labeled and the rest as unlabeled, Semi-iNat has been split into train, validation, public test, and private test set. We test FixMatch , one of the state-of-the-art SSL algorithms, which is designed for class-balanced data. By calculating validation recall, validation precision of each class and test precision on Semi-iNat dataset, we find that FixMatch only has limited help for fine-grained image recognition on Semi-iNat.
As shown in the Fig.2, Salvia nemorosa speices and Salvia tesquicola speices are members of Salvia genus. However, during training phase, an image belonging to Salvia tesquicola is classified incorrectly as Salvia nemorosa by FixMatch, which shows that FixMatch has a weak learning ability for fine-grained features. When selecting pseudo-labeles from unlabeled data, we found that FixMatch introduces noise from out-of-class data, so that in the next iteration, the model is heavily biased towards out-of-class data. In addition, the quality of pseudo-labels is reduced due to class imbalance, also resulting in the poor performance on Semi-iNat. These empirical findings motivate us to improve the model’s ability to learn fine-grained features and alleviate the impact of class imbalance and domain mismatch.
To achieve this goal, we introduce BiSTF, a bilateral-branch self-training framework for domain-shifted and class-imbalanced semi-supervised fine-grained recognition illustrated in Fig.3.
3.4 Bilateral-branch self-training framework
In SSL, self-training as an iterative method is widely used. It trains the model for multiple generations, in which each generation involves two training steps, supervised training step and generating pseudo-labels step.
As shown in Fig.3, our BiSTF consists of abovementioned two main steps. In supervised training step, the model contains three main components. Concretely, we design two branches for in-class representation learning and semi-supervised classifier learning, termed “in-class learning branch” and “semi-rebalancing branch”, respectively.
Existing SSL algorithms usually ignore the subtle but discriminative features in fine-grained recognition, hence a network structure with an attention mechanism is necessary. In order to satisfy the needs of fine-grained recognition, both branches use the same EfficientNet  network structure, where SENet  can effectively capture the fine-grained features rather than residual network structure, and share all the weights except for the last MBConv block.
For the bilateral branches, we separately apply uniform and reversed samplers proposed in BBN  to each of them and obtain two samples and as the input data, where is from labeled set for the in-class learning branch and is from the union set of and sampled pseudo-labels
generated by next step for the semi-rebalancing branch. The uniform sampler retains the characteristics of original distributions, and therefore benefits the representation learning. While, the reversed sampler aims to alleviate the extreme imbalance and particularly improve the recognition accuracy on minority class. Crucially, the two samplers with different sampling methods are the biggest difference between the two branches. Then, two samples are fed into corresponding branch, and by global average pooling the feature vectorsand can be acquired.
Furthermore, we also introduce the specific cumulative learning strategy in to shift the mode’s learning “attention” in the supervised training step. Different from BBN, the adaptive trade-off parameter directly affects the model’s ability to bias towards in-class data and re-balance the data by controlling the weights for and . The outputs will be integrated together by element-wise addition after sending the weighted feature vectors and into the classifiers and
respectively. The output logits are formulated as
where is the predicted output, i.e., . For each class
, the the probability of the class is calculated by softmax function
Then, we generally denote the output probability distribution as,
as the cross-entropy loss function. Thus, our model generates a weighted cross-entropy recognition loss, which is illustrated as
As observed in Sec.3.3, out-of-class unlabeled data will easily be introduced in the early stage of training and cause interference to the model. Therefore, instead of iteration update strategy of FixMatch, we propose “stochastic epoch update” strategy when supervised training step has thoroughly trained. Specifically, at the beginning of training, to avoid introducing out-of-class noise to affect the model performance, we perform pseudo-labeling on the unlabeled dataset with a small probability and selectively add them to the training phase, and then the probability of epoch updating gradually increases. We define whether to update or not as a flag .
Pseudo-labeling leverages the idea of using the model obtained from the first step to generate artificial labels for unlabeled data. Specifically, when the largest class probability fall above a predefined threshold, this hard label will be retained as pseudo-label. Letting , pseudo-labeling uses the following function:
where is the threshold. The pseudo-labeled set .
To accommodate the class-imbalance, this paper proposes “contain same distribution” strategy, that instead expands the labeled set with a selected subset , i.e., , rather than with all samples in . The biased pseudo-labels generated by an initial model trained on imbalanced data will intensify the bias. Consequently, in the selection process, we follow the strategy of keeping the selected pseudo-labeled data distribution consistent with the in-class data distribution, which avoids that the data distribution of is gradually biased towards majority classes. For the next generation, the labeled set and the union set will be fed into the “in-class learning branch” and “semi-rebalancing branch”, respectively.
4.1 Datasets and empirical settings
Semi-iNat In addition to the introduction of Semi-iNat in Sec.3.1, a parameter that is critical to our experiments is the maximum imbalance rate with the value of . In this paper, the official splits of train, validation, public test, and private test set are utilized for fair comparisons.
4.2 Implementation details
To be fair, we train the ResNet-50 
leveraged by the official as our backbone network by standard mini-batch stochastic gradient descent (SGD) with momentum of 0.9, weight decay of. Our experiments follow the data augmentation strategies proposed in FixMatch : resize image to , random horizontal flip with probability, randomly resizecrop a patch from the original image or its horizontal flip with scale from 0.2 to 1.0 and ratio from 0.75 to 1.33, as well as RandAugment  keeping the same settings in FixMatch. We train all the models on a single NVIDIA A100 GPU with batch size of 64. The initial learning rate is set to 0.01 and the learning rate during subsequent training is decayed by ReduceLROnPlateau scheduler with patience of 5.
4.3 Main results
First, we compare our model with baseline reproduced according to the official and FixMatch, and present the results in Tab.2. Due to the utilization of data augmentation, the accuracy of the baseline reproduced by us is improved by 6.21 over the result by the official that is for reference only. Although FixMatch performs reasonably well on public and private test set, its improvement is not as obvious as in the basic SSL benchmark. In contrast, BiSTF improves the accuracy of FixMatch and achieves as much as 1.79 absolute performance gain.
|Val||Public test||Private test|
We also observe that our model works particularly well and achieves 1.75 and 1.79 accuracy gain on public test and private test data, respectively. We hypothesize the reason is that by stochastic epoch update strategy our model finds more correctly pseudo-labeled samples to augment the labeled set instead of iteration update strategy of FixMatch. In addition, by observing the performance on the validation set, it shows that the ability of BiSTF to learn fine-grained features has been improved.
We further report the performance of BiSTF with different backbones and image sizes in Tab.3. After resizing images to , this paper first directly evaluates several common backbones, including Resnet101 , ResneXt101 , EfficientNet-b5-7 . All the backbones are able to further boost the performance by another few points, resulting in 5.0 to 10.25 absolute accuracy improvement compared to BiSTF with backbone of Resnet50. Introducing the noisy student  to EfficientNet, the results of BiSTF can be further improved. Among these backbones, EfficientNetb7ns achieves the best performance, so we take it as the final baseline. Finally, applying 5-fold cross-validation after expanding the validation data to training data, our BiSTF model further gives accuracy gains, producing the best results.
|Val||Public test||Private test|
4.4 Ablation studies
We perform an extensive ablation study to evaluate and understand the contribution of critical component in BiSTF. The experiments in this section are all performed with BiSTF on Semi-iNat.
Effect of update probability. BiSTF introduces the ”Stochastic epoch update strategy” that controls the update frequency. Fig.4 shows how update strategy influences performance over generations.
When P = 1 in the whole process of training, our method updates the dataset in each generation. Besides, two other update strategies are also tested. Specifically, the update probability varies linearly and in separated stages with the epoch, respectively. To show the source of accuracy improvements, in Tab.4 we present accuracy on the validation set of Semi-iNat. The results suggest that various stochastic update strategies affect the learning ability of the model to a certain extent by changing the pseudo-labeled data sampled for training, where the linear strategy produces the best result.
|Stochastic epoch update strategy||Val|
In this work, we present a bilateral-branch self-training framework, named BiSTF for domain-shifted and imbalanced semi-supervised fine-grained recognition. BiSTF is motivated by the observation that in addition to ignoring the subtle but discriminative features, existing SSL algorithms are vulnerable to class imbalance and domain mismatch. BiSTF iteratively refines a baseline SSL model with a labeled set expanded by adding pseudo-labeled samples from an unlabeled set, where pseudo-labeled samples contain same data distribution with the labeled dataset. Over generations of self-training, the model becomes less biased towards majority classes and out-of-class data, focusing more on in-class data. Extensive experiments on Semi-iNat datasets demonstrate that the proposed BiSTF outperform the existing state-of-the-art SSL algorithm.
-  (2019) Mixmatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §1, §2.1.
A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106, pp. 249–259. Cited by: §1, §2.2.
What is the effect of importance weighting in deep learning?.
International Conference on Machine Learning, pp. 872–881. Cited by: §1.
-  (2019) Learning imbalanced datasets with label-distribution-aware margin loss. arXiv preprint arXiv:1906.07413. Cited by: §1.
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research16, pp. 321–357. Cited by: §2.2.
-  (2020) Randaugment: practical automated data augmentation with a reduced search space. In , pp. 702–703. Cited by: §4.2.
-  (2019) Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277. Cited by: §1, §2.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2, §4.2, §4.3.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §3.4.
-  (2020) Class-imbalanced semi-supervised learning. arXiv preprint arXiv:2002.06815. Cited by: §2.3.
Cost-sensitive learning of deep feature representations from imbalanced data. IEEE transactions on neural networks and learning systems 29 (8), pp. 3573–3587. Cited by: §2.2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §3.3.
-  (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §2.1.
-  (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §2.1.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.2.
-  (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.1.
-  (2020) Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685. Cited by: §1, §1, §1, §2.1, §3.3, §4.2.
-  (2021) The semi-supervised inaturalist challenge at the fgvc8 workshop. arXiv preprint arXiv:2106.01364. Cited by: §1, §1, §3.1.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §3.4, §4.3.
-  (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §1.
-  (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §2.1.
Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §1, §2.1, §4.3.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §4.3.
-  (2020) Rethinking the value of labels for improving class-imbalanced learning. arXiv preprint arXiv:2006.07529. Cited by: §2.3.
-  (2020) Bbn: bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9719–9728. Cited by: §3.4.