Data-driven Meta-set Based Fine-Grained Visual Classification

08/06/2020 ∙ by Chuanyi Zhang, et al. ∙ Nanjing University The University of Adelaide 0

Constructing fine-grained image datasets typically requires domain-specific expert knowledge, which is not always available for crowd-sourcing platform annotators. Accordingly, learning directly from web images becomes an alternative method for fine-grained visual recognition. However, label noise in the web training set can severely degrade the model performance. To this end, we propose a data-driven meta-set based approach to deal with noisy web images for fine-grained recognition. Specifically, guided by a small amount of clean meta-set, we train a selection net in a meta-learning manner to distinguish in- and out-of-distribution noisy images. To further boost the robustness of model, we also learn a labeling net to correct the labels of in-distribution noisy data. In this way, our proposed method can alleviate the harmful effects caused by out-of-distribution noise and properly exploit the in-distribution noisy samples for training. Extensive experiments on three commonly used fine-grained datasets demonstrate that our approach is much superior to state-of-the-art noise-robust methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep Neural Networks (DNNs) have achieved impressive results on many computer vision tasks due to the availability of large-scale image datasets

(Yao et al., 2018b, a; Xie et al., 2019; Luo et al., 2019; Xie et al., 2020; Zhang et al., 2018a; Lu et al., 2020; Chen et al., 2020; Tang et al., 2017; Shu et al., 2019b). However, fine-grained visual classification (FGVC) remains challenging. Training DNNs for FGVC tends to require expert-annotated labels, possibly with additional annotations in the form of parts, attributes, or relationships. The high cost of manual annotation limits the FGVC dataset scale and constrains the performance and scalability of the model. To reduce the cost of manual labeling, a growing number of works focused on the semi-supervised paradigm (Xu et al., 2015; Niu et al., 2018; Cui et al., 2016) and utilized web images as data augmentation.

Web images have distinct advantages over manual labeled ones: rich and free. For arbitrary categories, the potential training data can be easily obtained from publicly-available sources like Google Image. Leveraging web data can easily build large-scale datasets at nearly no manual cost (Yao et al., 2017, 2019; Hua et al., 2016; Zhang et al., 2017, 2016b). Unfortunately, web data inevitably contains label noise which is a huge obstacle for training robust deep FGVC models. The label noise in web data for fine-grained recognition can be roughly divided into two sets: in-distribution and out-of-distribution. Specifically, the in-distribution noisy images have their true labels in the dataset, while the true labels for out-of-distribution noisy samples are outside of the dataset. Since DNNs have a high capacity to fit noisy data (Arpit et al., 2017; Zhang et al., 2016a), directly utilizing noisy web images to train fine-grained recognition models usually results in poor performance.

A simple yet effective approach to deal with label noise is to perform samples selection that separates clean instances from noise. The representative works are Decoupling (Malach and Shalev-Shwartz, 2017) and Co-teaching (Han et al., 2018)

. These works drop samples with a high probability of being incorrectly labeled to reduce the harmful influence of noise. Nevertheless, they can’t exploit the in-distribution noisy instances for representation learning and have the risk to discard some clean images. To make full use of the noisy dataset, some works concentrated on loss correction that revises the corrupted labels. For example,

(Goldberger and Ben-Reuven, 2016)

added an additional softmax layer to estimate the label noise transition matrix. Unfortunately, since the label noise in practice is diverse and non-stationary, the exact recovery of the noise transition matrix is difficult and remains a challenging problem.

(Song et al., 2019) selectively refurbished unclean samples which have consistent label predictions. However, the network has the risk to produce wrong labels. Moreover, these loss correction methods are unable to tackle out-of-distribution noisy images, whose true labels are outside the set of training labels.

In this work, we propose a hybrid approach that leverages the advantages of both ”sample selection” and ”loss correction” to learn from noisy web images for fine-grained task. We make an assumption that the model can access a small set of clean meta images during training. Our key idea is to discard out-of-distribution noise as well as to relabel in-distribution noisy samples with the help of small clean meta-data. Specifically, we train a selection net to learn the similarities between noisy web images and clean meta set. It produces the probability that a sample is in-distribution and we drop these images which have a low in-distribution probabilities. Simultaneously, we train a labeling net to generate pseudo labels for the remaining in-distribution noisy images. In this way, our proposed approach can learn from the noisy web images with the guidance of clean meta data. It can properly utilize in-distribution noisy images for training and alleviate the harmful effects caused by out-of-distribution noise.

In summary, this paper makes the following three-fold contributions: (1) We propose a data-driven meta-set based approach that combines the ”sample selection” and ”loss correction” methods by overcoming their drawbacks. Compared with sample selection approaches, our approach can relabel and exploit the in-distribution noisy samples for boosting training. Compared with loss correction methods, our method is less likely to suffer from the correction error and can deal with out-of-distribution noisy samples. (2) We explained the effectiveness of our proposed approach from the perspective of mathematical theory. Our proposed method has a better interpretability. (3) Extensive experiments and ablation studies demonstrate that our approach outperforms state-of-the-art methods.

2. Related Works

Fine-grained Visual Classification: The task of fine-grained visual classification is to distinguish objects at the subordinate level. Since differences between subcategories tend to lie in discriminative parts, the early works train the network to learn discriminative features by utilizing strong annotation like bounding boxes or part annotations (Huang et al., 2016; Yao et al., 2016; Lam et al., 2017; Wei et al., 2018; Zhang et al., 2014; Xie et al., 2015). Despite satisfying performance, these strongly supervised methods require heavy human annotation. To avoid heavy manual labeling work, there have been proposed a number of weakly supervised methods, which only need image-level labels (Fu et al., 2017; Lin et al., 2015; Zheng et al., 2017; Wang et al., 2018; Ge et al., 2019; Zheng et al., 2019; Chen et al., 2019; Korsch et al., 2019; Zhang et al., 2016c; He and Peng, 2017; Peng et al., 2017; Zhang et al., 2016d; Branson et al., 2014). However, the label annotation still requires expert knowledge. This drawback limits the dataset scale and therefore constrains the model performance and scalability. To further improve the performance, some semi-supervised methods manage to leverage easily accessible web data (Niu et al., 2018; Cui et al., 2016; Xu et al., 2016; Niu et al., 2015; Xiao et al., 2015; Krause et al., 2016; Van Horn et al., 2015) for FGVC. However, these methods mainly utilize web images as data augmentation and still require a large number of well-labeled images. Different from them, our approach merely requires a small set of clean images to guide noise identification and label correction, and the model is mainly trained from web images.

Learning from Web Images: Training fine-grained recognition models with web images usually results in poor performance due to the presence of label noise and data bias. Numerous studies have been performed to address the problem of learning from noisy web images (Yao et al., 2020; Zhang et al., 2018b; Sun et al., 2019; Zhang et al., 2020a). In this paper, we categorize them into two groups: sample selection based methods and loss correction based methods. Sample selection methods are a straightforward way to tackle label noise by discarding the noisy ones. The representative sample selection methods contain Decoupling (Malach and Shalev-Shwartz, 2017) and Co-teaching (Han et al., 2018). They both trained two peer networks simultaneously. Specifically, Decoupling chooses samples that have different predictions from two networks as useful ones. Co-teaching lets each network selects small-loss samples as clean ones for its peer network. These methods have achieved remarkable performance on noisy data by ignoring all unclean samples. However, they fail to utilize potentially useful in-distribution noisy images and have the risk to eliminate clean samples. Loss correction approaches aim to correct the misguidance caused by label noise. The correction methods are diverse, including assigning a weight to the current prediction (Reed et al., 2014), estimating label noise transition (Goldberger and Ben-Reuven, 2016), and relabeling samples with the network prediction (Song et al., 2019). Although these methods have shown significant performance on the manual noisy dataset, they are unable to tackle out-of-distribution samples in noisy web data. Our method is a combination of sample selection and loss correction approach by achieving their advantages and overcoming their drawbacks. It can robustly learn from noisy web images in the real-world dataset.

Data-driven Approaches: Considering that learning merely from noisy web images is challenging, some works make the assumption that during training the model has access to a small set of clean labels (Li et al., 2020, 2018). For example, (Hendrycks et al., 2018) leveraged a small set of clean data to estimate the noise transition matrix. (Li et al., 2017) distilled the knowledge learned from the clean dataset to facilitate learning a better model from the entire noisy dataset. (Ren et al., 2018) and (Shu et al., 2019a) utilized a meta-learning algorithm to reweight training examples. Specifically, (Ren et al., 2018) dynamically learned weights based on samples’ gradient directions, while (Shu et al., 2019a) learned an explicit weighting function. With access to a small set of clean samples, these data-driven methods achieve remarkable performance and robustness. Inspired by them, our approach leverages a small number of clean images to guide sample selection and label correction.

Figure 1. The architecture of our proposed data-driven meta-learning based approach (a), training mechanism of selection net (b), and training mechanism of labeling net (c). For each mini-batch from the noisy web training set, we first split it by training loss . Small images are regarded as clean data and directly utilized for updating the networks. Samples with large are entered into to compute their probabilities of being in-distribution . Then samples with small are identified as out-of-distribution noisy images and discarded. Those with large are regarded as in-distribution noisy images. They are relabeled by and leveraged for training with their pseudo labels . Our and both take the image features as input and are updated using the meta set.

3. The Proposed Approach

3.1. Overview

The architecture of our proposed framework is presented in Fig. 1 (a). Let be the pair of sample and its label , be the noisy web training set. We assume that there is a small clean and unbiased meta set , where

. In addition, we also assume that the training set contains the meta set. Our goal is to train the classifier network

with the parameter on the noisy web training set.

Since deep neural networks have the ability to filter out noisy instances using their loss values at the early training stage (Arpit et al., 2017; Zhang et al., 2016a)

, we simply utilize the entire training set to train the network in initial epochs. When a mini-batch

is formed from , is updated through:

(1)

where , , and

denote the loss function, learning rate, epoch number and initial epoch number, respectively.

After the initial epochs, we start to adopt our proposed algorithm to tackle noise. We first adopt the widely used loss-based separation method (Malach and Shalev-Shwartz, 2017) that selects of low-loss instances as clean samples. The mini-batch is divided into a clean set and a noisy set , which can be obtained by solving the following problem:

(2)
(3)

where is the drop rate. Then is directly leveraged for training, while images in are entered into the selection net to calculate their probability to be in-distribution via:

(4)

where denotes the features of sample and is the parameter of . Furthermore, we select samples that have high as in-distribution noisy images and discard others. The in-distribution set can be obtained by:

(5)

where is the relabeling rate. To boost the robustness of the model, we also propose a labeling net which produces pseudo labels for these selected in-distribution noisy images through:

(6)

where is the parameter of . Finally, these relabeled in-distribution noisy images along with clean ones are utilized for training. The parameter is updated according to the descent direction of the expected loss as in Eq. (7).

(7)

and are trained using the meta set . Their training mechanisms will be detailedly illustrated in the following sections.

3.2. Selection Net

The architecture and training mechanism of our selection net are shown in Fig. 1 (b). To be specific,

is an MLP (multilayer perceptron) network with one hidden layer containing 256 nodes. We apply ReLU activation function on each hidden node and utilize Sigmoid activation function to guarantee the output located in the interval of

. It takes the image features as input and produces the probability that a sample is in-distribution. In our approach, we utilize to distinguish in- and out-of distribution noisy images.

We optimize the parameters of in a meta-learning based method. Specifically, we first sample a mini-batch from the training set, and formulate the classifier learning function through:

(8)

We utilize the output of as the weight of sample . In this way, we make the to be a function of . Accordingly, we draw a mini-batch from the meta set. It is entered into the classifier net with parameter to calculate the meta loss. Then, we can update via:

(9)

where denotes the learning rate of . The computation of in Eq. (9) can be obtained by back-propagation with the following derivation:

(10)

where . Then Eq. (9) can be rewritten as:

(11)

In this formula, the coefficient represents the similarity between the gradient of the training sample computed on training loss and the average gradient of the meta data calculated on meta loss. Therefore, if the learning gradient of a training sample is similar to that of the meta images, then it will be considered as in-distribution and tends to produce a higher score for it. On the contrary, samples with gradient different from the that of the meta set will have a lower score. In this way, our learns to leverage the feature representation to distinguish in- and out-of-distribution samples.

3.3. Labeling Net

Fig. 1 (c) illustrates our training mechanism of . General DNNs are composed of a series of convolutional layers as the feature extractor and a fully connected layer as the classifier. On the basis of this framework, we add an additional fully connected layer as after the last convolutional layer. It takes the image features as input and produces a pseudo label. learns from the clean meta set. Specifically, we first draw a mini-batch from the meta set and utilize the classifier net to extract the image features for each sample in . Then, we update by:

(12)

We adopt the same learning rate for network and , because and the classifier of are parallel fully connected layers (See Fig. 1 (c)), which share the same size of input and output. In this training mechanism, learns to dynamically produce labels through the image features during training.

Input: training set , meta set , drop rate , relabeling rate , batch size , epoch , and .
Initialize classifier network parameter , parameter and parameter .
for  do
        for  do
               Sample a mini-batch from and from ;
               Formulate the learning function by Eq. (8);
               Update by Eq. (11) and by Eq. (12);
               if  then
                      Update by Eq. (1);
              else
                      Obtain via Eq. (2) and by Eq. (3);
                      Calculate by Eq. (4) and via Eq. (5);
                      Generate pseudo labels for through Eq. (6);
                      Update with Eq. (7);
               end if
              
        end for
       
end for
Output: Updated parameters , , and .
Algorithm 1 Data-driven Meta-learning Based Fine-Grained Recognition Algorithm

We directly utilize the feature extractor trained on the noisy dataset and simply leverage a fully connected layer as . The aim is to avoid over-fitting and make more generalized. If we pre-train a DNN from the meta set as , it will probably suffer from over-fitting, as only a small number of clean images is available (Li et al., 2017). As a result, it can’t produce reliable pseudo labels. On the contrary, in our approach, the feature extractor learns from a large number of web training images. Hence, it is more generalized than that trained from a small dataset. In this way, our doesn’t need to extract the image feature by itself and only learns to produce labels from the image features. It is less likely to suffer from over-fitting. Our approach takes full advantage of the meta set by utilizing it to train and for sample selection and loss correction, respectively. We train , and classifier net simultaneously in an end-to-end manner. With the help of and , our approach has the ability to utilize in-distribution noisy images and can re-use clean samples, which may be eliminated in the selection guided by training loss. To conclude, our approach combines the sample selection and loss correction methods and overcomes their drawbacks. It can train the model robustly from noisy web images in the real-world dataset. The detailed steps of our proposed approach are summarized in the Algorithm 1.

Supervision Method Publication BBox/Anno Training Set Datasets
CUB200-2011 FGVC-Aircraft Cars-196
Strongly Part-Stacked CNN (Huang et al., 2016) CVPR 2016 anno. 76.60 - -
Coarse-to-fine (Yao et al., 2016) TIP 2016 anno. 82.90 87.70 -
HSnet (Lam et al., 2017) CVPR 2017 anno. 87.50 - 93.90
Mask-CNN (Wei et al., 2018) PR 2018 anno. 85.70 - -
Weakly Bilinear CNN (Lin et al., 2015) ICCV 2015 anno. 84.10 83.90 91.30
RA-CNN (Fu et al., 2017) CVPR 2017 anno. 85.30 - 92.50
Multi-attention (Zheng et al., 2017) ICCV 2017 anno. 86.50 89.90 92.80
Filter-bank (Wang et al., 2018) CVPR 2018 anno. 86.70 92.00 93.80
Parts Model (Ge et al., 2019) CVPR 2019 anno. 90.40 - -
TASN (Zheng et al., 2019) CVPR 2019 anno. 89.10 - 93.80
DCL (Chen et al., 2019) CVPR 2019 anno. 87.80 93.00 94.50
Webly WSDG (Niu et al., 2015) CVPR 2015 web 70.61 - -
Xiao et al. (Xiao et al., 2015) CVPR 2015 web 70.92 - -
Semi Xu et al. (Xu et al., 2016) TPAMI 2018 anno.+web 84.60 - -
Cui et al. (Cui et al., 2016) CVPR 2016 anno.+web 80.70 - -
Niu et al. (Niu et al., 2018) CVPR 2018 anno.+web 76.47 - -
Meta Decoupling (Malach and Shalev-Shwartz, 2017) NeurIPS 2017 meta.+web 75.04 81.88 84.78
Co-teaching (Han et al., 2018) NeurIPS 2018 meta.+web 82.62 82.81 88.78
MW-Net (Shu et al., 2019a) NeurIPS 2019 meta.+web 77.37 77.17 84.09
Our Approach - meta.+web 84.19 83.95 89.83
Table 1. ACA (%) performances on three benchmark fine-grained datasets. BBox/Anno (✓) indicates human annotations are utilized during training. Training set denotes the dataset is a manually labeled one (anno.), a small clean meta set (meta.), or collected from the web (web).

4. Experiments

4.1. Datasets and Evaluation Metric

We evaluate our approach on three commonly used benchmark fine-grained datasets, CUB200-2011 (Wah et al., 2011), FGVC-Aircraft (Maji et al., 2013), and Cars-196 (Krause et al., 2013).
Average Classification Accuracy (ACA

) is taken as the evaluation metric, which is widely used for evaluating the performance of fine-grained visual classification.

4.2. Implementation Details

We leverage the web images collected in (Zhang et al., 2020b) as the noisy training set and directly adopt the testing data from CUB200-2011, FGVC-Aircraft, and Cars-196 as the test set. The small clean meta set is built by randomly sampling 10 images for each category from the benchmark training set CUB200-2011, FGVC-Aircraft, and Cars-196. Accordingly, we have 18388, 13503 and 21448 noisy web training images as well as 2000, 1000 and 1960 small clean meta images for CUB200, FGVC-Aircraft and Cars-196 dataset, respectively.

We utilize a pre-trained ResNet-34 (He et al., 2016) model as our backbone network and select the drop rate from the values of {0.15, 0.2, 0.25, 0.3, 0.35, 0.4}. We ultimately set the value of to be 0.35, 0.20, and 0.25 as the default value on CUB200, FGVC-Aircraft and Cars-196 dataset, respectively. For other parameters, we set relabeling rate for CUB200 and FGVC-Aircraft dataset, and for Cars-196 dataset. The initial epoch number is set to be 5 on all three datasets. We use an SGD optimizer with momentum, and train our model for 100 epochs with batch size set as 60. The learning rate is set to be 0.01, which is decayed with a cosine annealing (Loshchilov and Hutter, 2016), and is fixed as 0.01.

4.3. Baseline Methods

To illustrate the superiority of our approach, our baselines contain the following five sets of state-of-the-art fine-grained methods.
1) Strongly supervised methods: Part-Stacked CNN (Huang et al., 2016), Coarse-to-fine (Yao et al., 2016), HSnet (Lam et al., 2017), and Mask-CNN (Wei et al., 2018).
2) Weakly supervised methods: Bilinear CNN (Lin et al., 2015), RA-CNN (Fu et al., 2017), Filter-bank (Wang et al., 2018), Multi-attention (Zheng et al., 2017), Parts Model (Ge et al., 2019), TASN (Zheng et al., 2019), and DCL (Chen et al., 2019).
3) Webly supervised methods: WSDG (Niu et al., 2015), and Xiao et al. (Xiao et al., 2015).
4) Semi-supervised methods: Xu et al. (Xu et al., 2016), Niu al (Niu et al., 2018), and Cui et al. (Cui et al., 2016).
5) Meta data based methods: Decoupling (Malach and Shalev-Shwartz, 2017), Co-teaching (Han et al., 2018), and MW-Net (Shu et al., 2019a).
It should be noted that all the meta data based methods have the same backbone network ResNet-34 as ours. Their implementations are the same as ours, including learning rate, optimizer and batch size. Moreover, we set the same drop rate as ours for Co-teaching. Experiments are conducted on one NVIDIA V100 GPU card.

4.4. Experimental Results and Analysis

Table 1 presents the fine-grained ACA results of various approaches on benchmark datasets. As demonstrated in Table 1, with the guidance of a small clean meta set, our proposed approach shows significant improvements over web-supervised approaches (Niu et al., 2015; Xiao et al., 2015). Compared with semi-supervised methods that utilize a large number of manually labeled images (Niu et al., 2018) or even additional human annotations (Xu et al., 2016; Cui et al., 2016), our approach achieves close or superior performance on CUB200 dataset. Moreover, our approach remarkably surpasses other meta data based methods on all benchmark datasets.

Backbone Method Performance (%)
ResNet-18 Discarding (Baseline) 75.89
No-relabeling 74.09
Ditillation (Li et al., 2017) 74.13
Self-correction (Song et al., 2019) 75.41
Ours 77.48
Table 3. The ACA (%) performances comparison of different training set sizes.
Number of Images Performance (%)
50 67.35
60 71.06
70 72.94
80 75.63
90 77.48
Table 2. The ACA performances (%) comparison of different loss correction methods.
Figure 2. The test accuracies of baselines and ours on CUB200 dataset (a), and our test accuracies on datasets CUB200, Cars-196 and FGVC-Aircraft dataset (b).
Figure 3. The training loss (a) and

(b) of clean, in-distribution and out-of-distribution noisy images in initial epochs. Bars indicate the standard deviation.

Fig. 3 (a) presents the test accuracies of baselines and ours on CUB200 dataset. By observing Fig. 3 (a), we can notice that our approach shows better performance, training speed and stability than meta data based baselines. MW-Net (Shu et al., 2019a) requires a small clean meta set, however, it doesn’t take full advantage of the small clean meta set. Compared with MW-Net (Shu et al., 2019a), our approach makes better use of the clean meta set by utilizing it for sample selection and loss correction simultaneously. As a result, our approach outperforms MW-Net significantly on all three benchmark datasets.

From Fig. 3 (a), we can observe that Decoupling (Malach and Shalev-Shwartz, 2017) shows the worst performance among all meta data based methods. The reason is that it discards a large number of web training images with which the model can be boosted. We notice that the drop rate of Decoupling climbs as training proceeds, the average value of which is 77%. In this situation, many clean images are discarded. To overcome this drawback, our approach leverages a fixed drop rate to ensure that most of the images are utilized for training. Co-teaching (Han et al., 2018), which shares the same drop rate as ours, performs best among our baselines except ours. The reason is that our approach can relabel and reuse in-distribution noisy images for boosting the training of the model.

Fig. 3 (b) presents the test accuracies vs. the number of epochs on all three benchmark datasets. By observing Fig. 3 (b), we can find that the training processes on CUB200 and Cars-196 are fast and stable. This phenomenon demonstrates the superiority of our approach. However, the test accuracy on FGVC-Aircraft fluctuates during training. It may result from that FGVC-Aircraft has the smallest clean meta set among benchmark datasets. A smaller number of trusted images make the selection results and pseudo labels less stable. Although the training process has some fluctuations, the test accuracy becomes stable at the end of training, with a higher value than that of baselines (See Table 1). From this result, we can conclude that our proposed approach still works even though the clean meta set is very small.

5. Ablation Studies

In this section, we will further demonstrate how our approach works. To save time and computing resources, we conduct experiments on the CUB200 dataset with a relatively small pre-trained network ResNet-18 (He et al., 2016) as our backbone. To better demonstrate the effectiveness of our approach, we remove the small clean meta set from the training set, which means that the meta set is only utilized to train the selection net as well as the labeling net , the classifier net is only updated by noisy web images.

Datasets Training Set Performance Improvement
CUB200-2011 anno. 79.46 6.35
anno.+web 85.81
FGVC-Aircraft anno. 82.21 7.32
anno.+web 89.53
Cars-196 anno. 88.20 3.36
anno.+web 91.56
Table 5. The ACA (%) performances and improvements of ResNet-18, ResNet-34 and ResNet-50. Baseline denotes that the network is directly trained on the web training set.
Backbone Method Performance Improvement
ResNet-18 Baseline 68.59 8.89
Ours 77.48
ResNet-34 Baseline 74.06 6.51
Ours 80.57
ResNet-50 Baseline 74.92 6.84
Ours 81.76
Table 4. The ACA (%) performances and improvements of data augmentation. Anno. and web denote the dataset is manually labeled and collected from the web, respectively.
Figure 4. The ACA (%) performances comparison of different meta set sizes.
Figure 5. The parameter sensitivities of drop rate (a) and relabeling rate (b).

5.1. Effectiveness of Sample Selection

In this experiment, we compare the performance of and cross-entropy training loss in the process of samples selection. We first manually add 10 clean images, 10 in-distribution noisy images, and 10 out-of-distribution noisy images into the training set. Then we record their and loss values in the initial epochs during training. The experimental results are shown in Fig. 3.

By observing Fig. 3 (a), we can notice that loss values of in- and out-of-distribution noisy images are similar, which is much higher than that of clean images. This phenomenon indicates that utilizing loss values can split clean images from the training set but can’t distinguish in- or out-of-distribution noisy samples. From Fig. 3 (b), we can observe that clean and in-distribution noisy images share a close value of , while out-of-distribution noisy samples show a lower . By leveraging the value of , we can distinguish in-distribution ones from noisy images. From Fig. 3 (a) and (b), we can conclude that utilizing loss and simultaneously can efficiently identify clean, in- and out-of-distribution noisy samples. With the ability to distinguish in- and out-of-distribution images, our approach can further utilize in-distribution noisy images by relabeling them.

5.2. Influence of Different Training Set Sizes

We investigate the impact of data scale by changing the number of web images per category on CUB200. Specifically, we collect 50, 60, 70, 80, 90 images from the web for each category. As shown in Table 3, the ACA performance improves steadily by using more web training images. Hence, leveraging web images for FGVC is a promising research direction as web data is rich and easy to obtain.

5.3. Effectiveness of Loss Correction

To demonstrate the superiority of our labeling net , we replace our in our framework with other loss correction approaches for comparison. These loss correction methods are as follows: 1) Discarding (Baseline): Dropping all noisy images and conducting no loss correction; 2) No-relabeling: Utilizing in-distribution noisy images for training without correcting their labels; 3) Distillation (Li et al., 2017): Pre-training a Resnet-18 model on the validation set for relabeling; 4) Self-correction (Song et al., 2019): Utilizing the predictions of classifier network as pseudo labels for in-distribution noisy images. In this experiment, we fix the value of as 0.35 and as 0.05.

The experimental results are shown in Table 3. From Table 3, we can observe that the model suffers from label noise if the labels of in-distribution noise are not corrected (No-relabeling). Utilizing these noisy images for training is worse than simply discarding them (Baseline). By relabeling in-distribution noisy images, Distillation (Li et al., 2017) and Self-correction (Song et al., 2019) slightly outperform No-relabeling. Nevertheless, they still show worse performance than the baseline. Their unsatisfying performance results from the correction error. Distillation (Li et al., 2017) is easy to suffer from over-fitting, because the clean meta set only contains 2000 images. The model pre-trained on the small meta set has poor generalization performance and can’t produce reliable pseudo labels (Its test accuracy is only 53.33%). Since fine-grained recognition is challenging, the classifier network tends to give wrong predictions during training, and then Self-correction produces wrong pseudo labels. Our approach overcomes their drawbacks by training with the guidance of the clean meta set. It produces reliable pseudo labels and remarkably outperforms other loss correction approaches.

Figure 6. Sample selection results on noisy web training set in (Zhang et al., 2020b). Clean and in-distribution noisy images are similar. Out-of-distribution noisy samples are diverse and totally different from images in the CUB200 dataset.

5.4. Influence of Data Augmentation

To demonstrate the advantage of leveraging web images, we train a ResNet-18 model on three benchmark datasets with our web images as data augmentation. Specifically, we utilize the intact manual labeled dataset as the validation set. In addition, we only adopt our algorithm on web images, and manually labeled ones are directly leveraged for training. The results are given in Table 5. From Table 5, we can observe that leveraging web images significantly improves the performance across different datasets. The improvements on CUB200, FGVC-Aircraft and Cars-196 are 6.35%, 7.32% and 3.23%, respectively. This result also indicates that leveraging web images is an effective method to improve the robustness of FGVC model.

5.5. Influence of Different Backbones

We investigate the applicability of our approach by conducting experiments with different backbone networks on the CUB200 dataset. The experimental results are demonstrated in Table 5. From Table 5, it can be observed that our approach shows remarkable improvements across different backbones. Guided by some meta images, networks can learn from web data more efficiently.

Architectures Performance (%)
Baseline 75.89
FC 76.75
100 76.92
256 77.48
1024 76.58
Table 6. The performances of different structures. Baseline indicates discarding all noisy images. FC and numbers denote the fully connected layer, and the number of hidden nodes in the MLP, respectively.

5.6. Influence of Different Meta Set Sizes

In this experiment, we change the number of meta images for each category to investigate the influence of meta set size. The experimental results are illustrated in Fig. 5. From Fig. 5, we can find that unless the meta set is extremely small (e.g., each category only has 4 images or less), our approach can outperform the baseline by relabeling in-distribution noisy images. Most importantly, the performance remains roughly stable when the image number increases to 6 or more. This phenomenon indicates that our approach is robust and a small clean meta set can ensure reliability and performance.

5.7. Influence of Different Architectures

In this experiment, we compare the ACA performances of different architectures: a fully connected layer and MLP networks with one hidden layer containing 100, 256 and 1024 nodes. Table 6 illustrates the experimental results. By observing Table 6, we can find that all architectures can outperform the baseline and achieve close test accuracies. Even utilizing the simplest fully connected layer as can work well. This result indicates that our approach is robust and doesn’t require complex network architectures.

5.8. Parameter Sensitivities

For the parameter sensitivities analysis, we study the drop rate and relabeling rate on three datasets. As illustrated in Fig. 5 (a), each dataset has its own optimal drop rate. The best value is 0.35 for CUB200, 0.2 for Cars-196, and 0.25 for FGVC-Aircrafts. Before increases to the optimal value, the ACA performance rises steadily as increases. The improvement may benefit from discarding more noisy samples. If exceeds the optimal value, the performance obviously declines. One possible explanation is that too many instances are discarded and the network can not get sufficient training data. Fig. 5 (b) presents the influence of relabeling rate . By observing Fig. 5 (b), we can find that the optimal relabeling rate for CUB200, Cars-196 and FGVC-Aircrafts is 0.05, 0.15 and 0.05, respectively. On Cars-196 dataset, the ACA performance climbs as increases to the optimal value. It probably results from leveraging more in-distribution noisy images for training. Conversely, the performance drops as increases () on CUB200 and FGVC-Aircrafts dataset. The reason may be that some out-of-distribution samples are utilized and misguide the model as rises.

5.9. Visualization

To intuitively demonstrate our sample selection ability, we visualize the selection results on noisy web training set (Zhang et al., 2020b) in Fig. 6. From Fig. 6, we can observe that three kinds of samples are clearly separated. In-distribution noisy images are similar to clean ones but are probably mislabeled. Some out-of-distribution samples are related to the bird but totally different from images in manually labeled dataset CUB200, e.g., books or drawings. This selection result demonstrates that our approach is practical and able to deal with the noisy real-world dataset. Moreover, our approach can be utilized to refurbish the web training set by discarding harmful out-of-distribution noisy images.

6. Conclusion

In this paper, we presented a data-driven meta-set based method to learn from noisy web images for fine-grained visual classification. Our motivation is to combine the ”sample selection” and ”loss correction” method, and utilize the in-distribution noisy web images for boosting training. Comprehensive experiments on three real-world scenario datasets demonstrate that our approach is much superior to state-of-the-art meta-set based methods for fine-grained visual classification.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 61976116, 61702265, 61932020), National Key R&D Program of China (No. 2018AAA0102001), Fundamental Research Funds for the Central Universities (No. 30920021135), and Natural Science Foundation of Jiangsu Province (No. BK20170856).

References

  • D. Arpit, S. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In

    International Conference on Machine Learning

    ,
    pp. 233–242. Cited by: §1, §3.1.
  • S. Branson, G. Van Horn, S. Belongie, and P. Perona (2014) Bird species categorization using pose normalized deep convolutional nets. In British Machine Vision Conference, pp. 1–14. Cited by: §2.
  • T. Chen, J. Zhang, G. Xie, Y. Yao, X. Huang, and Z. Tang (2020) Classification constrained discriminator for domain adaptive semantic segmentation. In IEEE International Conference on Multimedia and Expo, pp. 1–6. Cited by: §1.
  • Y. Chen, Y. Bai, W. Zhang, and T. Mei (2019) Destruction and construction learning for fine-grained image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5157–5166. Cited by: §2, Table 1, §4.3.
  • Y. Cui, F. Zhou, Y. Lin, and S. Belongie (2016) Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1153–1162. Cited by: §1, §2, Table 1, §4.3, §4.4.
  • J. Fu, H. Zheng, and T. Mei (2017)

    Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438–4446. Cited by: §2, Table 1, §4.3.
  • W. Ge, X. Lin, and Y. Yu (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3043. Cited by: §2, Table 1, §4.3.
  • J. Goldberger and E. Ben-Reuven (2016) Training deep neural-networks using a noise adaptation layer. Cited by: §1, §2.
  • B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In The Conference and Workshop on Neural Information Processing Systems, pp. 8527–8537. Cited by: §1, §2, Table 1, §4.3, §4.4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §4.2, §5.
  • X. He and Y. Peng (2017) Fine-grained image classification via combining vision and language. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5994–6002. Cited by: §2.
  • D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel (2018) Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in neural information processing systems, pp. 10456–10465. Cited by: §2.
  • X. Hua, F. Shen, J. Zhang, and Z. Tang (2016) A domain robust approach for image dataset construction. In ACM international conference on Multimedia, pp. 212–216. Cited by: §1.
  • S. Huang, Z. Xu, D. Tao, and Y. Zhang (2016) Part-stacked cnn for fine-grained visual categorization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1173–1182. Cited by: §2, Table 1, §4.3.
  • D. Korsch, P. Bodesheim, and J. Denzler (2019) Classification-specific parts for improving fine-grained visual categorization. arXiv preprint arXiv:1909.07075. Cited by: §2.
  • J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei (2016) The unreasonable effectiveness of noisy data for fine-grained recognition. In European Conference on Computer Vision, pp. 301–320. Cited by: §2.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In IEEE International Conference on Computer Vision, pp. 554–561. Cited by: §4.1.
  • M. Lam, B. Mahasseni, and S. Todorovic (2017) Fine-grained recognition as hsnet search for informative image parts. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2520–2529. Cited by: §2, Table 1, §4.3.
  • Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li (2017) Learning from noisy labels with distillation. In IEEE International Conference on Computer Vision, pp. 1910–1918. Cited by: §2, §3.3, Table 3, §5.3, §5.3.
  • Z. Li, J. Tang, and T. Mei (2018) Deep collaborative embedding for social image understanding. IEEE transactions on pattern analysis and machine intelligence 41 (9), pp. 2070–2083. Cited by: §2.
  • Z. Li, J. Tang, L. Zhang, and J. Yang (2020)

    Weakly-supervised semantic guided hashing for social image retrieval

    .
    International Journal of Computer Vision. Cited by: §2.
  • T. Lin, A. RoyChowdhury, and S. Maji (2015) Bilinear cnn models for fine-grained visual recognition. In IEEE International Conference on Computer Vision, pp. 1449–1457. Cited by: §2, Table 1, §4.3.
  • I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    .
    In International Conference on Learning Representations, pp. 1–16. Cited by: §4.2.
  • J. Lu, H. Liu, Y. Yao, S. Tao, Z. Tang, and J. Lu (2020) Hsi road: a hyper spectral image dataset for road segmentation. In IEEE International Conference on Multimedia and Expo, pp. 1–6. Cited by: §1.
  • H. Luo, G. Lin, Z. Liu, F. Liu, Z. Tang, and Y. Yao (2019) Segeqa: video segmentation based visual attention for embodied question answering. In IEEE International Conference on Computer Vision, pp. 9667–9676. Cited by: §1.
  • S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv. Cited by: §4.1.
  • E. Malach and S. Shalev-Shwartz (2017) Decoupling” when to update” from” how to update”. In The Conference and Workshop on Neural Information Processing Systems, pp. 960–970. Cited by: §1, §2, §3.1, Table 1, §4.3, §4.4.
  • L. Niu, W. Li, and D. Xu (2015) Visual recognition by learning from web data: a weakly supervised domain generalization approach. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2774–2783. Cited by: §2, Table 1, §4.3, §4.4.
  • L. Niu, A. Veeraraghavan, and A. Sabharwal (2018)

    Webly supervised learning meets zero-shot learning: a hybrid approach for fine-grained classification

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7171–7180. Cited by: §1, §2, Table 1, §4.3, §4.4.
  • Y. Peng, X. He, and J. Zhao (2017)

    Object-part attention model for fine-grained image classification

    .
    IEEE Transactions on Image Processing 27 (3), pp. 1487–1500. Cited by: §2.
  • S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv. Cited by: §2.
  • M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018)

    Learning to reweight examples for robust deep learning

    .
    In International Conference on Machine Learning, Cited by: §2.
  • J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019a) Meta-weight-net: learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, pp. 1917–1928. Cited by: §2, Table 1, §4.3, §4.4.
  • X. Shu, J. Tang, G. Qi, W. Liu, and J. Yang (2019b) Hierarchical long short-term concurrent memory for human interaction recognition. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • H. Song, M. Kim, and J. Lee (2019) SELFIE: refurbishing unclean samples for robust deep learning. In International Conference on Machine Learning, pp. 5907–5915. Cited by: §1, §2, Table 3, §5.3, §5.3.
  • Z. Sun, F. Shen, L. Liu, and L. e. al. Wang (2019) Dynamically visual disambiguation of keyword-based image search.

    International Joint Conference on Artificial Intelligence

    , pp. 996–1002.
    Cited by: §2.
  • J. Tang, Z. Li, H. Lai, L. Zhang, S. Yan, et al. (2017) Personalized age progression with bi-level aging dictionary learning. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 905–917. Cited by: §1.
  • G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie (2015) Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604. Cited by: §2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1.
  • Y. Wang, V. I. Morariu, and L. S. Davis (2018) Learning a discriminative filter bank within a cnn for fine-grained recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4148–4157. Cited by: §2, Table 1, §4.3.
  • X. Wei, C. Xie, J. Wu, and C. Shen (2018) Mask-cnn: localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition 76, pp. 704–714. Cited by: §2, Table 1, §4.3.
  • T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang (2015) Learning from massive noisy labeled data for image classification. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. Cited by: §2, Table 1, §4.3, §4.4.
  • G. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao (2019) Attentive region embedding network for zero-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9384–9393. Cited by: §1.
  • G. Xie, L. Liu, F. Zhu, F. Zhao, Z. Zhang, Y. Yao, J. Qin, and L. Shao (2020) Region graph embedding network for zero-shot learning. In European Conference on Computer Vision, Cited by: §1.
  • S. Xie, T. Yang, X. Wang, and Y. Lin (2015) Hyper-class augmented and regularized deep learning for fine-grained image classification. In IEEE Winter Conference on Applications of Computer Vision, pp. 2645–2654. Cited by: §2.
  • Z. Xu, S. Huang, Y. Zhang, and D. Tao (2015) Augmenting strong supervision using web data for fine-grained categorization. In IEEE International Conference on Computer Vision, pp. 2524–2532. Cited by: §1.
  • Z. Xu, S. Huang, Y. Zhang, and D. Tao (2016) Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (5), pp. 1100–1113. Cited by: §2, Table 1, §4.3, §4.4.
  • H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian (2016) Coarse-to-fine description for fine-grained visual categorization. IEEE Transactions on Image Processing 25 (10), pp. 4858–4872. Cited by: §2, Table 1, §4.3.
  • Y. Yao, F. Shen, G. Xie, L. Liu, F. Zhu, J. Zhang, and H. T. Shen (2020) Exploiting web images for multi-output classification: from category to subcategories. IEEE Transactions on Neural Networks and Learning Systems 31 (7), pp. 2348–2360. Cited by: §2.
  • Y. Yao, F. Shen, J. Zhang, L. Liu, Z. Tang, and L. Shao (2018a) Extracting multiple visual senses for web learning. IEEE Transactions on Multimedia 21 (1), pp. 184–196. Cited by: §1.
  • Y. Yao, F. Shen, J. Zhang, L. Liu, Z. Tang, and L. Shao (2018b) Extracting privileged information for enhancing classifier learning. IEEE Transactions on Image Processing 28 (1), pp. 436–450. Cited by: §1.
  • Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang (2017) Exploiting web images for dataset construction: a domain robust approach. IEEE Transactions on Multimedia 19 (8), pp. 1771–1784. Cited by: §1.
  • Y. Yao, J. Zhang, F. Shen, L. Liu, F. Zhu, D. Zhang, and H. T. Shen (2019) Towards automatic construction of diverse, high-quality image datasets. IEEE Transactions on Knowledge and Data Engineering 32 (6), pp. 1199–1211. Cited by: §1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016a) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, pp. 1–15. Cited by: §1, §3.1.
  • C. Zhang, Y. Yao, J. Zhang, J. Chen, and et al. (2020a) Web-supervised network for fine-grained visual classification. In IEEE International Conference on Multimedia and Expo, pp. 1–6. Cited by: §2.
  • C. Zhang, Yao,Yazhou, H. Liu, G. Xie, X. Shu, T. Zhou, Z. Zhang, F. Shen, and Z. Tang (2020b) Web-supervised network with softly update-drop training for fine-grained visual classification. In AAAI Conference on Artificial Intelligence, pp. 12781–12788. Cited by: §4.2, Figure 6, §5.9.
  • J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang (2016b) Automatic image dataset construction with multiple textual metadata. In IEEE International Conference on Multimedia and Expo, pp. 1–6. Cited by: §1.
  • J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang (2017) A new web-supervised method for image dataset constructions. Neurocomputing 236, pp. 23–31. Cited by: §1.
  • J. Zhang, F. Shen, W. Yang, X. Hua, and Z. Tang (2018a) Extracting privileged information from untagged corpora for classifier learning.. In International Joint Conference on Artificial Intelligence, pp. 1085–1091. Cited by: §1.
  • J. Zhang, F. Shen, W. Yang, P. Huang, and Z. Tang (2018b) Discovering and distinguishing multiple visual senses for polysemous words. In AAAI Conference on Artificial Intelligence, pp. 523–530. Cited by: §2.
  • N. Zhang, J. Donahue, R. Girshick, and T. Darrell (2014) Part-based r-cnns for fine-grained category detection. In European Conference on Computer Vision, pp. 834–849. Cited by: §2.
  • X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian (2016c) Picking deep filter responses for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1134–1142. Cited by: §2.
  • Y. Zhang, X. Wei, J. Wu, J. Cai, J. Lu, V. Nguyen, and M. N. Do (2016d) Weakly supervised fine-grained categorization with part-based image representation. IEEE Transactions on Image Processing 25 (4), pp. 1713–1725. Cited by: §2.
  • H. Zheng, J. Fu, T. Mei, and J. Luo (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In IEEE International Conference on Computer Vision, pp. 5209–5217. Cited by: §2, Table 1, §4.3.
  • H. Zheng, J. Fu, Z. Zha, and J. Luo (2019) Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5012–5021. Cited by: §2, Table 1, §4.3.