In the past few years, labeled image datasets have played a critical role in computer vision tasks (Shu et al., 2019; Luo et al., 2019; Tang et al., 2017; Chen et al., 2020; Lu et al., 2020). To distinguish the subtle differences among fine-grained categories (e.g., birds (Wah et al., 2011), airplanes (Maji et al., 2013), or plants (Van Horn et al., 2018)), a large amount of well-labeled images are typically required. However, labeling objects at the subordinate level generally requires domain-specific expert knowledge, which is not always available for a human annotator from crowd-sourcing platforms like Amazon Mechanical Turk (Xie et al., 2019, 2020).
To reduce the cost of fine-grained annotation, many methods have been proposed which mainly focus on a semi-supervised learning paradigm(Cui et al., 2016; Niu et al., 2018; Xu et al., 2016; Yao et al., 2017). For example, Xu et al. (Xu et al., 2016) proposed to utilize detailed annotations and transfer as much knowledge as possible from existing strongly supervised datasets to weakly supervised web images for fine-grained recognition. Niu et al. (Niu et al., 2018) proposed a new learning scenario which only required experts to label a few fine-grained subcategories and can predict all the remaining subcategories by virtue of web data. Nevertheless, semi-supervised methods involve various forms of human intervention and have relatively limited scalability.
To further reduce the demand of manual annotation, leveraging web images to train FGVC models has attracted broad attention (Zhang et al., 2020a; Yao et al., 2020; Yao et al., 2018a, 2020). Owing to the error-prone automatic tagging system or non-expert annotation, web images for fine-grained categories are usually associated with massive label noise. Therefore, along with how to find discriminative regions in fine-grained images, how to handle label noise is another pivotal problem for training deep FGVC models with web images, which is also the focus of this paper. Statistical learning has contributed significantly to cope with label noise, especially in theoretical aspects (Yao et al., 2018b)
. However, in this work, we mainly focus on deep learning based methods.
One typical method, as illustrated in Fig. 1 (a), is using “loss correction” to correct the loss of training samples based on the estimated noise transition matrix (Reed et al., 2014; Patrini et al., 2017). The problem is that it is extremely difficult to get an accurate estimation of the noise transition matrix, thus inevitable false correction will lead to severe accumulated errors in the training process. Alternatively, as described in Fig. 1 (b), another popular training schema endeavors to adopt “sample selection” to identify and remove samples with label noise, and only use clean samples to update the networks (Malach and Shalev-Shwartz, 2017; Jiang et al., 2017; Han et al., 2018). Despite promising results that have been achieved in these approaches, the loss-based selection strategy favors “easy” examples. “Hard” and mislabeled samples (e.g., a “Laysan Albatross” image is labeled as “Black Footed Albatross”) are ignored although they are surprisingly beneficial in making FG models more robust.
As shown in Fig. 1
(c), our idea is to re-utilize informative images (“hard” and mislabeled examples) by selecting and correcting employable instances from high-loss samples. To be specific, we first split samples via the loss-based criterion. Low-loss instances are deemed to be clean, “easy” examples and their labels remain unaltered. Then, we perform a further partition on high-loss samples. In this separation process, we follow a simple but intuitive observation: for images of irrelevant categories, since they do not belong to any categories involved in the task, the network tends to be more confused when predicting their label probabilities. On the contrary, “hard” and mislabeled ones lean to obtain a relatively more certain prediction. After hard samples and mislabeled ones are selected out of high-loss instances, we manage to correct their labels and then feed them together with clean samples into the network for updating parameters. Extensive experiments and ablation studies on tasks of fine-grained image categorization demonstrate the superiority of our proposed approach over existing webly supervised state-of-the-art methods. The primary contributions of this work can be summarized as follows:
(1) A webly supervised deep model CRSSC is proposed to bridge the gap between FGVC tasks and numerous web images. Comprehensive experiments demonstrate that our approach outperforms existing state-of-the-art webly supervised methods by a large margin.
(2) Three types of FG web images (i.e., clean, reusable, and irrelevant) which inherently exist in collected web images are successfully identified and then separated by CRSSC. Compared with existing methods, our approach can further leverage reusable samples to boost the model learning.
(3) A novel label correction method which utilizes the prediction history of the network is proposed to re-label the reusable samples. Our experiments show that our label correction method is better than the existing one which uses the current epoch prediction results.
2. Related Work
Fine-grained Visual Classification Fine-grained visual classification (FGVC) aims to distinguish similar subcategories belonging to the same basic category. Generally, existing approaches can be roughly grouped into three categories: 1) strongly supervised methods, 2) weakly supervised methods, and 3) semi-supervised methods. Strongly supervised methods tend to require not only image-level labels but also manually annotated bounding boxes or part annotations (Branson et al., 2014; Huang et al., 2016; Wei et al., 2018). Different from strongly supervised methods, weakly supervised methods cease to use bounding boxes and part annotations. Instead, methods in this group only require image-level labels during training (Lin et al., 2017b; Lin and Maji, 2017; Gao et al., 2016; Kong and Fowlkes, 2017; Cui et al., 2017; Li et al., 2018; Dubey et al., 2018; Zheng et al., 2019; Chen et al., 2019; Ge et al., 2019). The third group involves leveraging web images in training the FGVC model (Cui et al., 2016; Niu et al., 2018; Xu et al., 2016). However, these approaches still contain a certain level of human intervention, making them not purely web-supervised.
Webly Supervised Learning Since learning directly from web images requires no human annotation, this learning scenario is becoming popular (Hua et al., 2016; Zhang et al., 2018, 2020b; Yao et al., 2018a; Sun et al., 2019; Zhang et al., 2017b, 2016). However, training deep FGVC models directly with web images usually leads to poor performance due to the existence of label noise and memorization effects (Zhang et al., 2017a; Arpit et al., 2017) of neural networks. Existing deep methods for overcoming label noise can be categorized into two sets (Song et al., 2019): 1) loss correction and 2) sample selection. Loss correction methods choose to correct the loss of training samples based on an estimated noise transition matrix (Reed et al., 2014; Patrini et al., 2017; Goldberger and Ben-Reuven, 2017; Chang et al., 2017a; Ren et al., 2018; Yi and Wu, 2019). However, due to the difficulty in accurately estimating the noise transition matrix, accumulated errors induced by false correction are inevitable (Jiang et al., 2017; Han et al., 2018). Sample selection methods identify clean samples out of mini-batches based on their losses and leverage them to update the network (Malach and Shalev-Shwartz, 2017; Jiang et al., 2017; Han et al., 2018; Kumar et al., 2010; Song et al., 2019). Nevertheless, loss-based selection methods would cause the domination of easy samples in the training procedure while hard ones get ignored substantially (Chang et al., 2017a; Shrivastava et al., 2016; Lin et al., 2017a; Song et al., 2019). Our proposed CRSSC can leverage additional reusable samples to boost the deep FGVC models.
3. The Proposed Approach
Fig. 2 presents the architecture of our proposed model. Generally, we can train a deep FGVC model through a well-labeled dataset , in which is the -th training sample and is the corresponding ground-truth label. In the conventional training schema, the model parameters are updated by optimizing a cross-entropy loss as follows:
where is the network parameter in the -th training epoch and is the network output of sample .
However, for web images , reliable labels are not always available and they are usually associated with label noise. Then we can divide into three subsets:
where indicates noisy samples, represents mislabeled ones, and stands for the clean set. More specifically, can be further separated into easy example set and hard example set :
It should be noted that, as the training proceeds, hard samples will gradually become “easy”. Thus, the split of the clean set changes in the training process. In this work, we aim to train a robust deep FGVC model through noisy web data. Our main idea is to properly select and then re-label informative training samples for boosting the robustness of the FGVC model. Based on the division described in Eq. (2) and (3), we regard the union of and as informative training set , though samples in have to be corrected before being fed into the model for further network optimization.
3.2. Drop and Reuse
To optimize the network with only useful knowledge from the web images, we have to 1) eliminate the negative influence from samples which belong to , 2) reduce the misleading impact of samples belonging to . Therefore, two key challenges of tackling label noise in web images are: 1) how to select samples which belong to and prevent the network from learning irrelevant samples, and 2) how to correct labels of mislabeled ones and reuse them as part of the informative knowledge.
Memorization effects (Arpit et al., 2017; Zhang et al., 2017a) indicate that, on noisy datasets, CNNs tend to first learn clean and easy patterns in initial epochs. As the number of epochs increases, CNNs will eventually overfit on noisy samples. Our key idea is to drop these noisy instances before they are memorized. A widely used sample selection strategy is to separate instances based on losses (Jiang et al., 2017; Han et al., 2018). These methods typically select a human-defined proportion of low-loss instances as clean samples and directly drop the rest ones. Although significant improvements have been achieved in these works for dealing with label noise, the way they separate instances would lead to the mistaken deletion of samples belonging to as hard examples also tend to produce high losses. Moreover, mislabeled samples would also be dropped by these sample selection methods due to their high losses.
To tackle drawbacks of these loss-based sample selection methods, we design a drop and reuse mechanism to further select useful instances from high-loss samples. Furthermore, for the purpose of avoiding involving the human-defined noise rate , we modify the conventional loss-based sample selection as in Definition 3.1. Through adopting our proposed loss-based drop module as well as the reuse module, we can effectively avoid mistakenly deleting hard examples and can also make full use of mislabeled samples.
Definition 3.1 ().
In a mini-batch , a sample belongs to only if its loss
We use the average loss as the selection threshold for dynamically separating informative samples from irrelevant ones. Specifically, due to limited robustness in initial epochs, more samples tend to have high losses. The selection threshold will be large and CNNs will learn easy patterns from as many samples as possible. As the training proceeds, CNNs gain more robust ability and more samples tend to have low losses. The selection threshold will be small and more samples will be dropped. In this situation, CNNs will discard as many samples as possible for ensuring the data learned by CNN is informative. In this way, our method can reduce the negative impact of error accumulation.
Why can we distinguish reusable samples from irrelevant ones? Intuitively, the predicted label probability of a reusable sample has a completely different pattern from that of an instance which belongs to an irrelevant category. Samples belonging to can be distinguished based on the confusion of prediction. For example, as shown in Fig. 3, in a bird classification task, when we feed a bird image (e.g., hard or mislabeled samples) into the model, the network tends to produce a certain prediction although it may have a high loss due to incorrect labeling. However, if we feed an irrelevant sample (e.g., a bird distribution map), which apparently belongs to in this task, the network would get confused and thus produce an uncertain prediction. Inspired by this observation, we formalize a new criterion to further select reusable samples from the ones with high-loss:
Definition 3.2 ().
In a mini-batch , a sample is reusable if its prediction certainty satisfies the condition: .
(Chang et al., 2017b)
demonstrated that the prediction variance can be used to measure the uncertainty of each sample in classification tasks. Therefore, in order to quantify the certainty of prediction, we simply adopt the standard deviation (the square root of the variance) of predicted probabilities for sample:
where is the number of categories, is the softmax result of , which is deemed as the (pseudo) predicted probability of sample belonging to the -th category, and is the mean value of predicted probabilities. Since , we can easily rewrite Eq. (4) as:
The standard deviation of predicted probabilities is consistent with prediction certainty. From Eq. (5) and the constraint of , we can have the following observations. 1) The standard deviation of predicted probabilities is bounded. 2) gets larger when one label’s probability (e.g., ) gets notably higher. 3) If is significantly higher than others’ value, the network is more certain about its prediction on this sample. reaches its maximum value when
In this case, the prediction certainty also reaches its maximum. On the other hand, gets smaller when all labels’ probabilities get closer to each other. In this case, the prediction certainty also drops. This observation is consistent with the intuition. It results from a fact that if the network produces a prediction in which each label’s probability is close to others, it means the network is highly confused about this sample. reaches its minimum value 0 when
In this situation, the model fails as all labels are predicted equally.
3.3. Label Correction
The reusable samples selected by the certainty-based criterion defined in Definition 3.2 are assembled into a sample set . To be specific, includes two types of images: 1) mislabeled instances and 2) hard examples that have correct labels.
In order to leverage these informative samples for training, their noisy labels have to be corrected. (Yang et al., 2018) proposed to use the current prediction to replace original labels. Nevertheless, due to a lack of robustness for CNN predictions on noisy datasets, using a single prediction to relabel mislabeled instances may result in error accumulation. Therefore, different from (Yang et al., 2018)
, we propose to correct noisy labels using the prediction history. This results from a fact that averaging distributions over classifier iterations can increase stability and reduce the influence of misleading predictions. Empirical results show that our label correction method works better.
Specifically, we record the label prediction as well as its corresponding predicted probability for each training sample . A history list is defined as follows:
in which is the label prediction of the sample in the -th epoch, and is its corresponding predicted probability. The history list is maintained to memorize each sample’s prediction of the previous epochs.
When calculating forward losses, we correct samples belonging to by replacing their original labels with corrected ones defined in Definition 3.3. For samples belonging to , we directly use their original labels. The leftover samples are regarded as irrelevant data and are excluded for robust training.
Definition 3.3 ().
For , its corrected label is the label prediction who has the highest accumulated probability in the previous epochs:
Why don’t we need to separate mislabeled and hard examples? As stated above, both mislabeled samples and hard ones are included in . It is difficult to make a reliable split on to distinguish hard samples and mislabeled ones. As a result, the label correction might also relabel hard samples using their previous predictions. However, both hard examples and mislabeled ones tend to produce consistent predictions on their true labels, though their predicted probabilities might be lower than those of easy samples. Therefore, using prediction history to relabel hard samples would not compromise the model.
3.4. Summary of CRSSC
Here, to further enhance the generalization performance of CRSSC, we adopt the Label Smoothing Regularization (Szegedy et al., 2016) when calculating the cross-entropy loss. That is to say, for input image , we adopt the following smoothed ground-truth probability in the loss calculation:
For CRSSC, we first train the network on whole training set in a conventional manner to warm up the network. The reason is that deep CNN has memorization effects (Zhang et al., 2017a; Arpit et al., 2017) and will learn clean and easy patterns in the initial epochs. With this warm-up step, CNN will get equipped with an initial learning capacity. Then, we perform a two-step sample selection in each mini-batch : 1) select low-loss instances from using the criterion defined in the Definition 3.1, and 2) identify reusable samples, which have high prediction certainty in high-loss instances based on the Definition 3.2. Subsequently, reusable samples are relabeled based on the Definition 3.3. Finally, parameters of the network are updated using the clean, easy sample set along with the reusable sample set . The detailed process of our proposed CRSSC is shown in Algorithm 1.
|FGVC-Aircraft (Maji et al., 2013)||CUB200-2011 (Wah et al., 2011)||Stanford Cars (Krause et al., 2013)|
|Strongly||Part-Stacked (Huang et al., 2016)||✓||anno.||-||76.6||-|
|Coarse-to-fine (Yao et al., 2016)||✓||anno.||87.7||82.9||-|
|HSnet (Lam et al., 2017)||✓||anno.||-||87.5||93.9|
|Mask-CNN (Wei et al., 2018)||✓||anno.||-||85.7||-|
|Weakly||iSQRT-COV (Li et al., 2018)||anno.||91.4||88.7||93.3|
|Parts Model (Ge et al., 2019)||anno.||-||90.4||-|
|TASN (Zheng et al., 2019)||anno.||-||89.1||93.8|
|DCL (Chen et al., 2019)||anno.||93.0||87.8||94.5|
|Semi||Cui et al. (Cui et al., 2016)||anno.+web||-||89.7||-|
|Xu et al. (Xu et al., 2016)||anno.+web||-||84.6||-|
|Niu et al. (Niu et al., 2018)||anno.+web||-||76.5||-|
|Cui et al. (Cui et al., 2018)||anno.+iNat||90.7||89.3||93.5|
|Webly||VGG-16 (Simonyan and Zisserman, 2014)||web||68.4||66.3||61.6|
|ResNet-50 (He et al., 2016)||web||60.4||64.4||60.6|
|B-CNN (Lin et al., 2017b)||web||64.3||66.6||67.4|
|Decoupling (Malach and Shalev-Shwartz, 2017)||web||75.9||70.6||75.0|
|Co-teaching (Han et al., 2018)||web||72.8||73.9||73.1|
4.1. Datasets and Evaluation Metrics
4.2. Implementation Details
|+ Def. 3.1||69.7|
|+ Def. 3.1 + Def. 3.2||69.9|
|+ Def. 3.1 + Def. 3.2 + Def. 3.3||73.4|
|+ Def. 3.1 + Def. 3.2 + Def. 3.3 + LSR (CRSSC)||75.6|
|+ Fine-tuned CRSSC||76.8|
Data preparation: To collect the web training set, we follow (Niu
et al., 2018) and retrieve images from image search engine using the category labels in benchmark datasets.
For ensuring no overlaps between the training and testing set, we additionally perform a PCA near-duplicate removal (Zhou et al., 2016) between collected web images and test images in the benchmark datasets. Finally, we regard filtered web images (13503 for FGVC-Aircraft, 18388 for CUB200-2011, and 21448 for Stanford Cars) as the training set and adopt testing images from original benchmark datasets.
CRSSC learning: We use a pre-trained model (e.g., VGG-16 (Simonyan and Zisserman, 2014)) as the basic CNN network. The number of warm-up epochs is tested in . The history list length limit is selected from . The LSR parameter is chosen from . During the network optimization, we adopt a SGD optimizer with momentum . The learning rate, batch size, and weight decay are set to be 0.01, 32, and 0.0003, respectively.
4.3. Baseline Methods
Strongly supervised methods require bounding boxes or part annotations during training. This set of baselines includes Part-Stacked (Huang
et al., 2016), Coarse-to-fine (Yao
et al., 2016), HSnet (Lam
et al., 2017), and Mask-CNN (Wei
et al., 2018). Weakly supervised methods require image-level labels, including Parts Model (Ge et al., 2019), iSQRT-COV (Li
et al., 2018), TASN (Zheng
et al., 2019), and DCL (Chen
et al., 2019). Semi-supervised methods leverage web images but remain involving human intervention, including Cui et al. (Cui
et al., 2016), Xu et al. (Xu
et al., 2016), Niu et al. (Niu
et al., 2018), and Cui et al. (Cui
et al., 2018). For strongly, weakly, and semi-supervised methods, we report the performances in their papers.
Webly supervised methods directly leverage web images without human involvement, including VGG-16 (Simonyan and Zisserman, 2014), ResNet-50 (He et al., 2016), B-CNN (Lin et al., 2017b), Decoupling (Malach and Shalev-Shwartz, 2017) and Co-teaching (Han et al., 2018). To be fair, we use the same backbone B-CNN (Lin et al., 2017b) in Decoupling, Co-teaching and our CRSSC. For basic networks VGG-16, ResNet-50, and B-CNN, we fine-tune them with noisy web images.
4.4. Experimental Results
Table 1 presents the comparison of ACA results on three benchmark datasets. It should be noted that the results of webly methods are all produced from experiments using exactly the same training data. By observing Table 1, we can notice that our approach performs better than other webly supervised methods on all three benchmark datasets. Compared with basic networks VGG-16, ResNet-50, and B-CNN, our CRSSC (with backbone B-CNN) can effectively alleviate the influence of label noise in the process of model training. Compared with state-of-the-art webly supervised methods Decoupling and Co-teaching, our approach can additionally identify reusable samples and salvage them by performing a label correction. Thus, our CRSSC can efficiently explore more useful samples to boost the robustness of the FGVC model.
5. Ablation Studies
5.1. Training Loss and Prediction Certainty
The prediction loss and certainty are two fundamental criteria for selecting informative samples. To investigate the distribution of prediction loss and certainty for clean, reusable, and dropped samples in training process, we select 30 instances in total, 10 images for each group (clean, reused, and dropped) and plot their prediction losses as well as prediction certainties. The experimental results are shown in Fig. 5.
By observing Fig. 5, we can find that as the network training forwards, the losses of clean samples decrease sharply while their prediction certainties increase steadily. Regarding reusable samples, although some of them have a fairly higher loss than clean ones, their prediction certainties increase remarkably as training progresses. The explanation is that reusable samples are either hard or mislabeled instances, thus tend to produce confident predictions consistently. The high loss and low prediction certainty of dropped samples demonstrate that our CRSSC can successfully identify and drop these irrelevant samples.
|Dataset||Net 1||ACA 1||Net 2||ACA 2|
|Backbone||FGVC Aircraft||CUB200||Stanford Cars|
5.2. Overlap of Identified Samples in Epochs
To further investigate the robustness of our sample selection strategy, we explore the overlap ratio of selected samples between adjacent epochs. To this end, we record the sample selection overlap between each epoch and its previous 1, 2, 3, 5, 8, 10 epochs. Let represent the selected dropped (clean) sample set in the -th epoch, Fig. 5 presents the sample selection overlap of dropped (clean) samples among the current -th epoch and its previous epochs . From Fig. 5, we can observe that both dropped and clean samples grow steadily and finally converge to a high level, which firmly proves the consistency and robustness of our sample selection strategy.
5.3. Influence of Different Backbones
It is well known that the choice of CNN architectures has a critical impact on object recognition performance. To investigate the influence of different backbones, we conduct experiments by using different basic networks VGG-16 (Simonyan and Zisserman, 2014), ResNet-18, and ResNet-50 (He et al., 2016). The experimental results are shown in Table 3.
From Table 3, we can have the following observations: 1) with a much deeper backbone network like ResNet-50, our CRSSC can yield significantly better performance than ResNet-18 and VGG-16. 2) When training a basic network directly with noisy web images, the basic network with higher capacity may produce a worse result. However, by adopting our CRSSC, we can make full use of the learning capacity of basic networks via properly selecting reusable samples and correcting their labels. Compared with the standard network, the improvement of performance demonstrates the superiority of our proposed approach.
5.4. Influence of Different Steps
In this subsection, we investigate the influence of various steps on a basic network like ResNet-18. We first add the Def. 3.1 on the ResNet-18 network to construct a baseline. We then add the Def. 3.1 and 3.2 to construct another baseline. For the third baseline, we add all the Def. 3.1, 3.2, and 3.3 to the ResNet-18. For the fourth baseline, we add the label smoothing technique to complete our CRSSC method. Finally, we present a fine-tuned CRSSC model as the last baseline. The experimental results on the CUB200-2011 dataset are summarized in Table 3. By observing Table 3, we can find that the fine-tuned CRSSC framework obtains the best performance.
5.5. Combining with Co-teaching
Our proposed method is flexible with regard to combining with other techniques since it only involves sample selection and relabeling. Here, we combine our CRSSC with a state-of-the-art method Co-teaching (Han et al., 2018) for further performance improvement. Following the setup in Co-teaching, we maintain two networks simultaneously. In each mini-batch, each network constructs its own and and subsequently feeds them into its peer network for further updating. Different from Co-teaching which only trains model with clean, easy samples, the combined method additionally leverages hard and mislabeled samples to promote the network optimization. Table 5 demonstrates the ACA results of combining CRSSC with Co-teaching in same and different backbones on CUB200-2011 dataset. Compared with the naive CRSSC (presented in Table 3), we can observe that great improvement has been achieved in Table 5.
5.6. Combining Web and Labeled Data
One of the roadblocks that limit the performance of fine-grained visual classification is the lack of enough labeled training data. The widely-used FGVC benchmark datasets (e.g., FGVC-Aircraft, CUB200-2011, and Stanford Cars) all suffer from limited training data, which severely prevented the FGVC task from being sufficiently benefited from the high learning capability of deep CNN. Therefore, employing web images as a supplement to existing fine-grained datasets also attracts considerable attention in recent years. Following the semi-supervised manner, we leverage collected web images as data augmentation to the labeled training data for training deep FGVC model. The experimental results of our CRSSC training on the combined data are shown in Table 5.
5.7. Trend of Samples in Training
We present the ratio variations of identified clean, reusable, and dropped samples during the training processes in Fig. 6 (left). From Fig. 6 (left), we can notice that the ratio of clean samples increases steadily until convergence while training progresses. As the training continues, the previous hard examples will gradually become “easy” for our model. On the contrary, as the network training proceeds, the certainty-based criterion will gradually reduce the mistaken dropping, thus leading to a firm decrease in the ratio of dropped samples until convergences to the ground-truth noise rate. Additionally, since hard examples get fewer with the growth of network capability, the ratio of reusable samples also decreases, until only mislabeled examples are left. The final convergence of three groups demonstrates the stability and robustness of our sample selection strategy. Fig. 6 (right) shows the drop rate among sorted mini-batches in one epoch. From Fig. 6 (right), we can find the imbalance of dropped sample ratios across each mini-batch, which proves the necessity of avoiding using a predefined drop rate.
5.8. Parameters in Proposed Approach
For the parameters analysis, we concern three parameters, including 1) the number of warm-up epochs , 2) the length of history list , and 3) the LSR smoothing level . Fig. 7 gives the results on CUB200-2011 dataset.
From Fig. 7 (left), we can observe that CRSSC is relatively stable when varying by fixing other two parameters. Both cases achieve the best performance when is selected in and we select as the default option. The length of history list affects the precision of label correction. Intuitively, a higher value of may benefit the label correction. However, the relabeling could also be misled due to poor predictions of early epochs. From Fig. 7 (middle), we notice that the best performance can be obtained when , we select as the default option. It should be noted that when = 0, this means that we use the prediction label of the current epoch instead of that with the highest accumulated probability in the previous epochs.
The smoothing level has an influence on the generalization ability. Compared with the case in which LSR is not leveraged (i.e. ), from Fig.7 (right), we can find that the classification accuracy increases considerably when adopting a proper level of label smoothing. When , the performance is fairly robust. However, as the gets larger than 0.6, the performance starts to decrease. This probably results from the fact that too large leads to a lack of proper ground-truth guidance in the training process.
5.9. Further Studies on Noisy-CIFAR100
To further explore the effectiveness of our approach, we follow Co-teaching (Han et al., 2018) and generate a synthetic dataset based on CIFAR100 (Krizhevsky and
Hinton, 2009) for further study. We first regard the last 20 categories of CIFAR100 as the irrelevant categories. Then we randomly select of the remaining training samples and corrupt their labels to simulate mislabeled data. We named this synthetic dataset as Noisy-CIFAR100.
Prediction Accuracy for Different Samples: As the memorization effects (Arpit et al., 2017; Zhang et al., 2017a) indicated, CNN tends to fit clean samples in initial epochs and will eventually fit noise data (i.e., irrelevant samples, and mislabeled ones). Fig. 8 presents the prediction accuracy of baseline model ResNet-18 (left) and our CRSSC (right) model. By observing Fig. 8 (left), we can observe that, while the prediction accuracy on clean training samples grows steadily in the training process, CNN eventually fits noisy training data and degrades the classification ability of the final model. From Fig. 8 (right), we can notice that, by using our CRSSC, the over-fitting to noise data are effectively suppressed. Besides, by using our relabeling strategy, mislabeled samples can be better learned and finally contribute to boosting the model classification ability.
Samples Selection and Relabeling Accuracy: Fig. 9 (left) shows the samples selection accuracy of our approach and (right) presents the samples relabeling accuracy. From Fig. 9 (left), we can find that the samples selection accuracy (including both reused and dropped samples) grows steadily as the training proceeds. In addition, the samples relabeling accuracy also has a steady increases in training.
In this work, we studied the problem of training FGVC models directly with noisy web images. Accordingly, we proposed a simple yet effective approach, termed as CRSSC, which trained a deep neural network using additionally selected hard and mislabeled samples to boost the robustness of the model. Comprehensive experiments showed that our approach has achieved state-of-the-art performance, compared with existing webly supervised methods.
This work was supported by the National Natural Science Foundation of China (No. 61976116) and Fundamental Research Funds for the Central Universities (No. 30920021135).
- Arpit et al. (2017) Devansh Arpit et al. 2017. A closer look at memorization in deep networks. In ICML. 233–242.
- Branson et al. (2014) Steve Branson et al. 2014. Bird species categorization using pose normalized deep convolutional nets. In BMVC.
- Chang et al. (2017a) Haw-Shiuan Chang et al. 2017a. Active bias: Training more accurate neural networks by emphasizing high variance samples. In NIPS. 1002–1012.
- Chang et al. (2017b) Haw-Shiuan Chang et al. 2017b. Active bias: Training more accurate neural networks by emphasizing high variance samples. In NIPS. 1002–1012.
- Chen et al. (2020) Tao Chen et al. 2020. Classification Constrained Discriminator For Domain Adaptive Semantic Segmentation. In ICME. 1–6.
- Chen et al. (2019) Yue Chen et al. 2019. Destruction and construction learning for fine-grained image recognition. In CVPR. 5157–5166.
et al. (2018)
Yin Cui et al. 2018.
Large scale fine-grained categorization and domain-specific transfer learning. InCVPR. 4109–4118.
- Cui et al. (2016) Yin Cui et al. 2016. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In CVPR. 1153–1162.
et al. (2017)
Yin Cui et al.
Kernel pooling for convolutional neural networks. InCVPR. 2921–2930.
- Dubey et al. (2018) Abhimanyu Dubey et al. 2018. Maximum-entropy fine grained classification. In NIPS. 637–647.
- Gao et al. (2016) Yang Gao et al. 2016. Compact bilinear pooling. In CVPR. 317–326.
- Ge et al. (2019) Weifeng Ge et al. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In CVPR. 3034–3043.
- Goldberger and Ben-Reuven (2017) Jacob Goldberger et al. 2017. Training deep neural-networks using a noise adaptation layer. In ICLR.
- Han et al. (2018) Bo Han et al. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NIPS. 8527–8537.
- He et al. (2016) Kaiming He et al. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
- Hua et al. (2016) Xian-sheng Hua et al. 2016. A domain robust approach for image dataset construction. In ACM MM. 212–216.
- Huang et al. (2016) Shaoli Huang et al. 2016. Part-stacked cnn for fine-grained visual categorization. In CVPR. 1173–1182.
- Jiang et al. (2017) Lu Jiang et al. 2017. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML.
- Kong and Fowlkes (2017) Shu Kong et al. 2017. Low-rank bilinear pooling for fine-grained classification. In CVPR. 365–374.
- Krause et al. (2013) Jonathan Krause et al. 2013. 3d object representations for fine-grained categorization. In CVPRW. 554–561.
- Krizhevsky and Hinton (2009) Alex Krizhevsky et al. 2009. Learning multiple layers of features from tiny images. Technical report, University of Toronto 1, 4 (2009), 7.
- Kumar et al. (2010) M Pawan Kumar et al. 2010. Self-paced learning for latent variable models. In NIPS. 1189–1197.
- Lam et al. (2017) Michael Lam et al. 2017. Fine-grained recognition as hsnet search for informative image parts. In CVPR. 2520–2529.
- Li et al. (2018) Peihua Li et al. 2018. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR. 947–955.
- Lin et al. (2017a) Tsung-Yi Lin et al. 2017a. Focal loss for dense object detection. In ICCV. 2980–2988.
- Lin and Maji (2017) Tsung-Yu Lin et al. 2017. Improved bilinear pooling with cnns. In BMVC. 117.1–117.12.
- Lin et al. (2017b) Tsung-Yu Lin et al. 2017b. Bilinear convolutional neural networks for fine-grained visual recognition. 40, 6 (2017), 1309–1322.
- Lu et al. (2020) Jiarou Lu et al. 2020. Hsi Road: A Hyper Spectral Image Dataset For Road Segmentation. In ICME. 1–6.
- Luo et al. (2019) Haonan Luo et al. 2019. Segeqa: Video segmentation based visual attention for embodied question answering. In ICCV. 9667–9676.
- Maji et al. (2013) Subhransu Maji et al. 2013. Fine-grained visual classification of aircraft. arXiv:1306.5151 (2013).
- Malach and Shalev-Shwartz (2017) Eran Malach et al. 2017. Decoupling ”when to update” from ”how to update”. In NIPS. 960–970.
- Niu et al. (2018) Li Niu et al. 2018. Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification. In CVPR. 7171–7180.
- Patrini et al. (2017) Giorgio Patrini et al. 2017. Making deep neural networks robust to label noise: A loss correction approach. In CVPR. 1944–1952.
- Reed et al. (2014) Scott Reed et al. 2014. Training deep neural networks on noisy labels with bootstrapping. In ICLR. 1–11.
- Ren et al. (2018) Mengye Ren et al. 2018. Learning to reweight examples for robust deep learning. In ICML. 4334–4343.
- Shrivastava et al. (2016) Abhinav Shrivastava et al. 2016. Training region-based object detectors with online hard example mining. In CVPR. 761–769.
- Shu et al. (2019) Xiangbo Shu et al. 2019. Hierarchical long short-term concurrent memory for human interaction recognition. TPAMI (2019).
- Simonyan and Zisserman (2014) Karen Simonyan et al. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014).
- Song et al. (2019) Hwanjun Song et al. 2019. SELFIE: Refurbishing unclean samples for robust deep learning. In ICML. 5907–5915.
- Sun et al. (2019) Zeren Sun et al. 2019. Dynamically visual disambiguation of keyword-based image search. IJCAI (2019), 996–1002.
- Szegedy et al. (2016) Christian Szegedy et al. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.
- Tang et al. (2017) Jinhui Tang et al. 2017. Personalized age progression with bi-level aging dictionary learning. TPAMI 40, 4 (2017), 905–917.
- Van Horn et al. (2018) Grant Van Horn et al. 2018. The inaturalist species classification and detection dataset. In CVPR. 8769–8778.
- Wah et al. (2011) Catherine Wah et al. 2011. The Caltech-UCSD Birds-200-2011 Dataset. CNS-TR-2011-001 (2011).
- Wei et al. (2018) Xiu-Shen Wei et al. 2018. Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization. PR 76 (2018), 704–714.
- Xie et al. (2019) Guo-Sen Xie et al. 2019. Attentive region embedding network for zero-shot learning. In CVPR. 9384–9393.
- Xie et al. (2020) Guo-Sen Xie et al. 2020. Region Graph Embedding Network for Zero-Shot Learning. In ECCV.
- Xu et al. (2016) Zhe Xu et al. 2016. Webly-supervised fine-grained visual categorization via deep domain adaptation. 40, 5 (2016), 1100–1113.
- Yang et al. (2018) Jufeng Yang et al. 2018. Recognition from web data: A progressive filtering approach. 27, 11 (2018), 5303–5315.
- Yao et al. (2016) Hantao Yao et al. 2016. Coarse-to-fine description for fine-grained visual categorization. 25, 10 (2016), 4858–4872.
- Yao et al. (2020) Yazhou Yao et al. 2020. Exploiting web images for multi-output classification: From category to subcategories. TNNLS 31, 7 (2020), 2348–2360.
- Yao et al. (2018a) Yazhou Yao et al. 2019a. Extracting multiple visual senses for web learning. TMM 21, 1 (2019), 184–196.
- Yao et al. (2018b) Yazhou Yao et al. 2019b. Extracting privileged information for enhancing classifier learning. TIP 28, 1 (2019), 436–450.
- Yao et al. (2017) Yazhou Yao et al. 2017. Exploiting web images for dataset construction: A domain robust approach. TMM 19, 8 (2017), 1771–1784.
- Yao et al. (2020) Yazhou Yao et al. 2019. Towards automatic construction of diverse, high-quality image datasets. TKDE 32, 6 (2020), 1199–1211.
- Yi and Wu (2019) Kun Yi et al. 2019. Probabilistic end-to-end noise correction for learning with noisy labels. In CVPR. 7017–7025.
- Zhang et al. (2017a) Chiyuan Zhang et al. 2017a. Understanding deep learning requires rethinking generalization. In ICLR.
- Zhang et al. (2020a) Chuanyi Zhang et al. 2020a. Web-Supervised Network with Softly Update-Drop Training for Fine-Grained Visual Classification. In AAAI. 12781–12788.
- Zhang et al. (2020b) Chuanyi Zhang et al. 2020b. Web-Supervised Network for Fine-Grained Visual Classification. In ICME. 1–6.
- Zhang et al. (2016) Fumin Shen et al. 2016. Automatic image dataset construction with multiple textual metadata. In ICME. 1–6.
- Zhang et al. (2017b) Jian Zhang et al. 2017b. A new web-supervised method for image dataset constructions. Neurocomputing 236 (2017), 23–31.
- Zhang et al. (2018) Jian Zhang et al. 2018. Discovering and distinguishing multiple visual senses for polysemous words. In AAAI. 523–530.
- Zheng et al. (2019) Heliang Zheng et al. 2019. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In CVPR. 5012–5021.
- Zhou et al. (2016) Bolei Zhou et al. 2016. Places: An image database for deep scene understanding. arXiv:1610.02055 (2016).