Fine-grained recognition aims at discriminating sub-categories that belong to the same general category, i.e., recognizing different kinds of birds [38, 2], dogs , and cars  etc.. Different from general category recognition, fine-grained sub-categories often share the same parts (e.g., all birds should have wings, legs, heads, etc.), and usually can only be distinguished by the subtle differences in texture and color properties of these parts (e.g.
, only the breast color counts when discriminating some similar birds). Although the advances of Convolutional Neural Networks (CNNs) have fueled remarkable progress for general image recognition[18, 32, 17, 39], fine-grained recognition still remains to be challenging where discriminative details are too subtle to discern.
Existing deep learning based fine-grained recognition approaches usually focus on developing better models for part localization and representation. Typical strategies include: 1) part based methods that first localize parts, crop and amplify the attended parts, and concatenate part features for recognition[24, 43, 4, 44]; 2) attention based methods that use visual attention mechanism to find the most discriminative regions of the fine-grained images [40, 13, 47]; 3) feature based methods such as bilinear pooling  or trilinear pooling  for better representation. However, we argue that for fine-grained recognition, the most critical challenge arises from the limited training samples, since collecting labels for fine-grained samples often requires expert-level domain knowledge, which is difficult to extend to large scale. As a result, existing deep models are easy to overfit to the small scale training data, and this is especially true when deploying with more complex modules.
In this paper, we promote fine-grained recognition via enriching the training samples at low cost, and propose a simple but effective data augmentation method to alleviate the overfitting and improve model generalization. Our method, termed as Attribute Mix, aims at enlarging the training data substantially via mixing semantically meaningful attribute features from two images. The motivation is that attribute level features are the key success factors to discriminate different sub-categories, and are transferable since all sub-categories share the same attributes. Toward this goal, we propose an automatic attribute learning approach to discover attribute features. This is achieved by training a multi-hot attribute level deep classification network through iteratively masking out the most discriminative parts, hoping that the network can focus on diverse parts of an object. The new generated images, accompanied with attribute labels mixed proportionally to the extent that two attributes fuse, can greatly enrich the training samples while maintaining the discriminative semantic meanings, and thus greatly alleviate the overfitting issue and are beneficial to improve model generalization.
, which all mix two samples by interpolating both images and labels. The difference is that both MixUp and CutMix randomly mix images or patches from two images, without considering their semantic meanings. As a result, it is usual for these two mixing operations to fuse images with non-discriminative regions, which in turn introduces noise and makes the model unstable for training. An example is shown in Fig.1. In comparison, Attribute Mix intentionally fuses two images at the attribute level, which results in more semantically meaningful images, and is helpful for improving model generalization.
Benefiting from the discovered attributes, we are able to introduce more training samples at attribute level with only generic labels. Here we denote the generic labels as single bit supervision indicating whether an object is from a general category, e.g., whether an object is a bird or not. We claim that the attribute features can be seamlessly transferred from the generic domain to the fine-grained domain without knowing the object’s fine-grained sub-category labels. This is achieved by a standard semi-supervised learning strategy that mines samples at the attribute levels. The mined samples, intrinsically with mixed attribute labels, and we term this proposed method which mines attribute from the general domain as Attribute Mix+. Note that although fine-grained labels are difficult to obtain, it is much easier to obtain a general label of an object. In this way, we are able to conveniently scale the fine-grained training samples via mining attributes from the generic domain for better performance investigation.
Our proposed data augmentation strategy is a general framework at low cost, and can be combined with most state-of-the-art fine-grained recognition methods to further improve the performance. Experiments conducted on several widely used fine-grained benchmarks have demonstrated the effectiveness of our proposed method. In particular, without any bells and whistles, using ResNet 101  as backbone, we achieve accuracies of , and on CUB-200-2011, FGVC-Aircraft and Standford Cars, respectively, which surpass the corresponding baselines by a large margin, while not sacrificing the speed comparing with the baseline. We hope that our research on attribute-based data augmentation could offer useful guidelines for fine-grained recognition.
To sum up, this paper makes the following contributions:
We propose Attribute Mix, a data augmentation strategy to alleviate the overfitting for fine-grained recognition.
We propose Attribute Mix+, which mines fine-grained samples at attribute level, and does not need to know the specific sub-category labels of the mined samples. Attribute Mix+ is able to scale up the fine-grained training for better performance investigation conveniently.
We evaluate our methods on three challenging datasets (CUB-200-2011, FGVC-Aircraft, Standford Cars), and achieve superior performance over the state-of-the-art methods.
2 Related Work
We briefly review some works for fine-grained recognition, as well as some recent technologies in data augmentation, which are most related with our work.
2.1 Fine-grained Recognition
Fine-grained recognition has been studied for several years. Early works on fine-grained recognition focus on leveraging extra parts and bounding box annotations to localize the discriminative regions of an object [4, 43, 44, 24]. Later, some weakly supervised localization methods [1, 33, 20, 29, 50, 31, 45] are proposed to localize objects with only image level annotations. In , Zhou et al. proposed to localize the objects by picking out the class-specific feature maps. In , Zhang et al. proposed to use adversarial training to locate the integral object and achieved superior localization results.
On the other hand, powerful features have been provided by better CNN networks. Lin et al.  proposed a bilinear structure to compute the pairwise feature interactions by two independent CNNs. And it turns out that higher-order features interaction can make the features highly discriminative . To model the subtle differences between two fine-grained sub-categories, attention mechanism [40, 47, 13] and metric learning [30, 7] are often used. Besides, Zhang et al. 
proposed to unify CNN with spatially weighted representation by Fisher Vectors, which achieves high performances on CUB-200-2011. Although promising performance has been achieved, these methods are all at the expense of higher computational cost, and are prohibitive to deploy on low-end devices.
Few works rely on external data to help facilitate recognition. Cui et al. 
proposed to use Earth Mover’s Distance to estimate the domain similarity, and transfer the knowledge from the source domain which is similar to the target domain. In, Krause et al.
collected millions of images with tags from Web and utilized the Web data by transfer learning. However, both methods make use of a large amount of class-specific labels for transfer learning. In this paper, we demonstrate that for fine-grained recognition, transferring attribute features is a powerful proxy to improve the recognition accuracy at low cost, without knowing the class-specific labels of the source domain.
2.2 Data Augmentation
Data augmentation can greatly alleviate overfitting in training deep networks. Simple transformations such as horizontal flipping, color space augmentations, and random cropping are widely used in recognition tasks to improve generalization. Recently, automatic data augmentation techniques, e.g., AutoAugment , are proposed to search for a better augmentation strategy among a large pool of candidates. Differently, Mixup  combined two samples linearly in pixel level, where the target of the synthetic image was the linear combination of one-hot label. Though not meaningful for human perception, Mixup has been demonstrated surprisingly effective for the recognition task. Following Mixup, there are a few variants [37, 16] as well as a recent effort named Cutmix , which combined Mixup and Cutout  by cutting and pasting patches. However, all these methods would inevitably introduce unreasonable noise due to augmentation operations on random patches of an image, without considering their semantic meanings. Our method is similar to these methods in mixing two samples for data augmentation. Differently, the mixing operation is only performed around those semantic meaningful regions, which enables the model for more stable training and is beneficial for generalization.
In this section, we describe our proposed attribute-based data augmentation strategy in detail. As shown in Fig. 2, the core ingredients of the proposed method consist of three modules: 1) Automatic attribute mining, which aims at discovering attributes with only image level labels. 2) Attribute Mix data augmentation, which mixes attribute features from any two images for new image generation. 3) Attribute Mix+, which enriches training samples via mining images from the same generic domain at attribute level.
3.1 Attribute Mining
The attribute level features are the core of the following data augmentation operation. We first elaborate how to obtain attribute level features with only image level labels. Denote as a training image and its corresponding one-hot label with , where is the number of fine-grained sub-categories. Without loss of generality, assuming that all fine-grained sub-categories share attributes, we simply convert the class level labels to more detailed, attribute level labels for attribute discovery, as shown in Fig. 3. Specifically, the one-hot label of image is extended to multi-hot label , while and , with each none zero hot regarded as one attribute corresponding to a specific sub-category. As shown in Fig. 2, for a typical CNN, we simply remove all the fully connected layers, and add a convolutional layer to produce feature maps with channels, here every adjacent channels correspond to attributes for a certain sub-category. These feature maps are fed into a GAP (Global Average Pooling) layer to aggregate attribute level features for classification. The multiple attribute mining is proceeded as follows:
Training multi-hot attribute level classification network with original images and attribute level multi-hot labels .
For an image with label , picking out the corresponding feature map at the th channel , generating the attention map according to the activations and upsampling to the original image size, thus we obtain the most discriminative region mask of an image.
is used to erase the original image to get erased image , and the corresponding multi-hot label, changes from to .
Using as a new training sample, and do the above three steps to obtain masks , i=2,…,k for all remained attributes.
Following the above procedures, we are able to train an attribute level classification network automatically and obtain a series of masks , which correspond to different attributes of an object. Fig. 4 shows an example of what the Attribute Mining procedure learns. It can be shown that these attributes coarsely correspond to different parts of an object, e.g., the birds’ heads, wings, and tails.
3.2 Attribute Mix
After obtaining the attribute regions, we introduce a simple data augmentation strategy on attribute level to facilitate fine-grained training. The proposed Attribute Mix operation constructs synthetic examples by intentionally mixing the corresponding attribute regions from any two images. Specifically, given two samples and with , and random picked attribute masks , , the generated training sample is obtained by:
where is the transformed binary mask from to , with the mask center-aligned with , and denotes the region that needs to be mixed. is a binary mask filled with ones, and is the element-wise multiplication operation. Like Mixup , the combination ratio
between two regions is sampled from the beta distribution. In all our experiments, we set , meaning that
is sampled from the uniform distribution. An illustration of the Attribute Mix operation is shown in Fig.1 , as well as some comparison results that generated by Mixup and CutMix operations. Compared with Mixup and CutMix which inevitably introduce some meaningless samples (e.g. random patches, background noise), Attribute Mix only focuses on the foreground attribute regions and is more suitable for fine-grained categorization.
Though more semantically meaningful our Attribute Mix operation is, the generated virtual samples still suffer large domain gap with the original, natural images. As a result, memorizing these samples would deteriorates the model generalization. To address this issue, we introduce a time decay learning strategy to limit the probability of applying Attribute Mix operation. This is achieved by controlling the mixing ratioin Eq. (1), i.e., we introduce a variable , which increases from 0 to 1 as the training proceeds, and limit . In this way, is sampled from distribution, and only when is larger than , the generated samples are used for training. As the training process goes on, the mixed operations between two images decay, and finally degenerate to using the original images. In the experimental section, we will validate its effectiveness in improving model generalization.
3.3 Attribute Mix
In principle, the attribute features are shared among images from the same super-category, regardless of their specific sub-category labels. This conveniently enables us to expand the training samples at attribute level to images from the generic domain. In this section, we enrich the training samples in another view, i.e., transfer attributes from images with only generic labels. This is achieved by generating soft, attribute level labels via a standard semi-supervised learning strategy over a large amount of images with low cost, general labels. Since attributes are shared among all the sub-categories, it is not necessary for images from the generic domain that belong to the same sub-categories with the target domain.
Using the model trained with Attribute Mix strategy, we conduct inference over the images from the generic domain, and produce attribute level probabilities by using a softmax layer. Specifically, denoting the model output as, we reshape the output to , where each row corresponds to the same attribute among different sub-categories. The softmax operation is conducted over the row dimension to obtain the probability :
where is a temperature parameter that controls the smooth degree of the probability.
, it is a common underlying assumption that classifiers’ decision boundary should not pass through high-density regions of the marginal data distribution. Similarly, in our experiment, the collected data with generic labels inevitably contain noise that would probably hurt the performance, and some of them are with too smooth soft-labels that can not provide any valuable attribute information. To address this issue, we propose an entropy ranking strategy to select images adaptively. Specifically, given an imagefrom the generic domain with soft attribute label , its entropy is calculated as follows:
where denotes the probability that image contains attribute of fine-grained sub-category . is large if the attribute distribution is smooth, and reaches its maximum when all attributes obtain the same probability, which we think carry no valuable information for fine-grained training. Based on this property, we set the criteria that samples with entropy that higher than threshold to be filtered out. In the ablation study, we will validate its effectiveness for achieving stable results, especially when noisy images exist.
4.1 Datasets and Implementation Details
Datasets. The empirical evaluation is performed on three widely used fine-grained benchmarks: Caltech-USCD Birds-200-2011 , Standford Cars , and FGVC-Aircraft , which belong to three generic domains: birds, cars and aircraft, respectively. Each dataset is endowed with specific statistic properties, which are crucial for recognition performance. CUB-200-2011 is the most widely used fine-grained dataset, which contains 11,788 images spanning 200 sub-species. Standford Cars consists of 16,185 images with 196 classes, which are produced by different manufacturers, while FGVC-Aircraft consists of 10,000 images with 100 species. In current view, these are all small scale datasets. We use the default training/test split for experiments, which gives us around 30 training examples per class for birds, 40 training examples per category for cars, and approximately 66 training examples per category for aircraft.
Implementation details. In our implementation, all experiments are based on ResNet-101  backbone. We first construct our baseline model over three datasets for following comparisons. During network training, the input images are randomly cropped to pixels after being resized to pixels and randomly flipped. For Standford Cars, color jittering is used for extra data augmentation. We train the models for epochs, using Stochastic Gradient Descent (SGD) with the momentum of 0.9, weight decay of 0.0001. The learning rate is set to , which decays by a factor of every epochs. During inference, the original image is center cropped and resized to . Benefiting from the powerful features, our baseline models have achieved considerable pleasing performance over these datasets. In the following, we will validate the effectiveness of the proposed method, even over such high baselines.
For multi-hot attribute level classification, a standard binary cross-entropy loss is used. We increase the training epochs to after introducing Attribute Mixed samples, since it needs more rounds to converge for these augmented data. When mining attributes from the generic domain, we use the model pretrained with Attribute Mix to inference over data from the generic domains to produce soft attribute level labels. During inference, the multi-hot attribute level classification network outputs the predicted scores of attributes, and we simply combine the predicted scores for attributes of each sub-category.
4.2 Collecting Data From Generic Domain
To mine the attribute information from the generic domain, we need to collect extra data with generic labels. Corresponding to the target fine-grained datasets, we collect data from three generic domains, i.e., birds, aircraft, and cars. We search the corresponding category data via some keywords from Flickr 111Flickr : https://www.flickr.com . Specifically, for birds, we simply choose keywords from 200 sub-categories of CUB-200-2011 and sub-categories of Nabirds , and make use of the keywords to crawl images from the Web. For simplicity, we do not do keywords or image filtering procedure as our method is robust to the label noise to some extent. As a result, we obtain about images from sub-categories. For aircraft, we assemble a list of 400 species from Wikipedia, which contain ”Airbus A-320”, ”Beriev a 40”, ”bede bd 5” etc., and obtain around images. For cars, we combine the species from Standford Cars and species from CompCars together to get a list of species as keywords for data collection, resulting in around images. We only retain the generic labels such as ”bird”, ”car” and ”aircraft”, and discard those specific labels used for crawling for the following experiments. Note that these images are directly crawled from the Web and we do not conduct any filtering procedure, which inevitably contain noisy images that do not belong to these three generic domains.
Cross-dataset redundancy. One concern when training using auxiliary datasets is that there might be redundancy between the crawled data and the test set. Even though we do not have the fine-grained labels on the auxiliary dataset, we still conduct a duplicate check that quantifies the extent to which the test images are contained in our crawled images. Following , we choose GIST 
descriptor matching, which has been shown to have excellent performance at near-duplicate image detection in large image collections. We remove those images that are probably used for test, and the remained number of images for birds, cars and aircraft are, and , respectively.
4.3 Ablation Study
In this section, we investigate some parameters which are important for recognition performance, as well as some in-depth analysis over the robustness of the image mining at attribute level. Unless otherwise specified, all experiments are conducted on CUB-200-2011.
|Dataset||Number of attributes||Acc.|
Effects of number of attributes . Here we inspect the recognition performance w.r.t. the number of attributes during Attribute Mining. The performances for different choices of are shown in Table 1. If we set too small, the model cannot mine adequate attributes of an object and the improvement is marginal. Specifically, denotes the baseline without multiple attribute mining. While for larger , the model is at the risk of including background clutters, and making the optimization difficult. We achieve the best performance when . In the following, we keep this parameter fixed for all the experiments for simplicity.
Impact of hyperparameter .
The hyperparameterin Eq. (1) plays an important role during mixing, which controls the strength of interpolation between attributes from two training samples. Here we try different choices with . The performances of different are shown in the left plot of Fig. 5, and the best performance can be achieved when is set to . For simplicity, we keep this parameter fixed for all the following experiments when needed.
Effects of adaptive Attribute Mix. We introduce the time decay strategy to alleviate the overfitting over the augmented samples. The probability of applying Attribute Mix decays from to as the curve. In order to validate the effectiveness of this strategy, we keep all hyperparameters constant and train the model using the Attribute Mix without time decay strategy. The comparison on CUB-200-2011 is shown in Table 2. It shows that Attribute Mix with the time decay learning strategy achieves higher accuracy of , which surpasses Attribute Mix without that strategy by points.
Comparison with image level image mining. In Section 3.3, the images from the generic domain are used to mine attribute level features. In order to validate the advantages of attribute level features, we compare with the traditional, semi-supervised learning using only image level labels. Specifically, the image level pseudo labels directly leverage the information from unlabeled data without considering the attribute information. For fair comparison, both methods use ResNet-101 model and the Attribute Mix is not applied during training. We use the model’s prediction on attribute level and image level separately to generate the soft-labels over those images from the generic domain. As shown in Table 3, our proposed attributes level features achieve much higher accuracy of , which surpasses the image level result by points. It is not a small gain considering the high baseline we used and the difficulty of CUB-200-2011 dataset.
Effects of for entropy ranking. When leverage the data from generic domain, we introduce the entropy ranking to select samples that share attributes contributing most for the fine-grained recognition. The entropy ranking mechanism investigates the correlations between the mined images from generic domain and fine-grained domain, and is robust to noisy images that probably do not belong to the same generic domain. In order to inspect the effectiveness of entropy ranking at length, we intentionally introduce some extra images with labels different from the generic labels, and test the performance of our method under such a situation. Specifically, we choose external dataset PASCAL VOC 2007  and 2012  as noisy images, which both contain 20 object classes. This is a dataset with multi-label images, and the number of samples that include birds is 1,377, only a small ratio (around ) over the whole dataset. Overall, this dataset can be treated as adding noisy images (around ) to the generic domain dataset. We evaluate our method using these noisy data, and the results with different thresholds are shown in the right plot of Fig. 5. It can be shown that with entropy ranking, images without valuable information will be filtered out, and the best result is comparable with the best result without intentionally adding noisy labels. When the threshold is set to , entropy ranking mechanism can filter out most of the noisy samples in VOC , only samples reserved. We claim that the advantages of ranking mechanism are obvious.
|Transfer Learning ||Inception-v3||89.6|
|Transfer Learning ||Inception-ResNet-v2 SE||89.3|
4.4 Comparisons with State-of-the-arts
We now move on to compare our proposed method with state-of-the-art works on above mentioned fine-grained datasets. In Table 4, we show the comparison results on CUB-200-2011. Noticed that ‘*’ denotes methods using external data. For fair comparison, we choose recent works which use similar backbone with us. MAMC  introduces complex attention modules to model the subtle visual differences. In , Pairwise Confusion procedure is designed to reduce the overfitting by intentionally introduced confusion in the activations. And bilinear interaction with high computational complexity is used in  to learn fine-grained image representations. As for our method, our proposed Attribute Mix achieves a superior accuracy of on CUB-200-2011 without any complicated modules or external data. Compared with the other data-augmentation methods, Attribute Mix outperforms the Mixup and CutMix at least points, which further demonstrates that Attribute Mix is more suitable for fine-grained categorization.
We also compare our proposed Attribute Mix with the work using the external data. Transfer learning  introduces external labeled fine-grained dataset iNat  and leverages this large scale dataset by transfer learning. As for our proposed Attribute Mix, we only use the data with low cost, generic labels. The result of Attribute Mix surpasses transfer learning  using labeled fine-grained dataset by points.
Our method also exhibits good performances on Standford Cars  and Aircraft . The comparison results on these two fine-grained dataset are shown in Table 5 and Table 6, respectively. Our proposed Attribute Mix achieves accuracies of and on these two datasets without external data, which bring about improvements of and points over the high baseline. When using the data with generic labels, the performances of Attribute Mix can be boosted to and . As far as we know, these accuracies are the best results over the two datasets. It can be seen that Attribute Mix achieves the state-of-art performances on all these fine-grained datasets. Most importantly, our proposed method does not increase inference budgets comparing with the baseline model.
|Transfer Learning ||Inception-v3||93.1|
|Transfer Learning ||Inception-ResNet-v2 SE||93.5|
|Transfer Learning ||Inception-v3||89.6|
|Transfer Learning ||Inception-ResNet-v2 SE||90.7|
This paper presented a general data augmentation framework for fine-grained recognition. Our proposed method, named Attribute Mix, conducts data augmentation via mixing two images at attribute level, and can greatly improve the performance without increasing the inference budgets. Furthermore, based on the principle that the attribute level features can be seamlessly transferred from the generic domain to the fine-grained domain regardless of their specific labels, we enrich the training samples with attribute level labels using images from the generic domain with low labelling cost, and further boost the performance. Our proposed method is a general framework for data augmentation at low cost, and can be combined with most state-of-the-art fine-grained recognition methods to further improve the performance. Experiments conducted on several widely used fine-grained benchmarks have demonstrated the effectiveness of our proposed method.
Berg, T., Liu, J., Lee, S.W., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N.: Birdsnap: Large-scale fine-grained visual categorization of birds. 2014 IEEE Conference on Computer Vision and Pattern Recognition pp. 2019–2026 (2014)
-  Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: A holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems. pp. 5050–5060 (2019)
-  Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952 (2014)
-  Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 113–123 (2019)
-  Cui, Y., Song, Y., Sun, C., Howard, A., Belongie, S.: Large scale fine-grained categorization and domain-specific transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4109–4118 (2018)
-  Cui, Y., Zhou, F., Lin, Y., Belongie, S.: Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1153–1162 (2016)
-  Cui, Y., Zhou, F., Wang, J., Liu, X., Lin, Y., Belongie, S.: Kernel pooling for convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2930 (2017)
-  DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
-  Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R., Naik, N.: Pairwise confusion for fine-grained visual classification. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 70–86 (2018)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
-  Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge 2007 (voc2007) results (2007)
-  Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4438–4446 (2017)
-  Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 580–587 (2014)
-  Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in neural information processing systems. pp. 529–536 (2005)
Guo, H., Mao, Y., Zhang, R.: Mixup as locally linear out-of-manifold regularization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 3714–3722 (2019)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
-  Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
-  Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs, CO (June 2011)
-  Kim, D., Cho, D., Yoo, D., So Kweon, I.: Two-phase learning for weakly supervised object localization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3534–3543 (2017)
-  Krause, J., Sapp, B., Howard, A., Zhou, H., Toshev, A., Duerig, T., Philbin, J., Fei-Fei, L.: The unreasonable effectiveness of noisy data for fine-grained recognition. In: European Conference on Computer Vision. pp. 301–320. Springer (2016)
-  Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13). Sydney, Australia (2013)
-  Lee, D.H.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML. vol. 3, p. 2 (2013)
-  Lin, D., Shen, X., Lu, C., Jia, J.: Deep lac: Deep localization, alignment and classification for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1666–1674 (2015)
-  Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 1449–1457 (2015)
-  Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
-  Miyato, T., Maeda, S.i., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41(8), 1979–1993 (2018)
-  Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision 42(3), 145–175 (2001)
-  Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 685–694 (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 815–823 (2015)
-  Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 618–626 (2017)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 3544–3553. IEEE (2017)
-  Sun, M., Yuan, Y., Zhou, F., Ding, E.: Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 805–821 (2018)
-  Van Horn, G., Branson, S., Farrell, R., Haber, S., Barry, J., Ipeirotis, P., Perona, P., Belongie, S.: Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 595–604 (2015)
-  Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., Belongie, S.: The inaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8769–8778 (2018)
-  Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Courville, A., Lopez-Paz, D., Bengio, Y.: Manifold mixup: Better representations by interpolating hidden states. arXiv preprint arXiv:1806.05236 (2018)
-  Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
-  Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3156–3164 (2017)
Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 842–850 (2015)
-  Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 6023–6032 (2019)
-  Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
-  Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based r-cnns for fine-grained category detection. In: European conference on computer vision. pp. 834–849. Springer (2014)
-  Zhang, N., Shelhamer, E., Gao, Y., Darrell, T.: Fine-grained pose prediction, normalization, and recognition. arXiv preprint arXiv:1511.07063 (2015)
-  Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.S.: Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1325–1334 (2018)
-  Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1134–1142 (2016)
-  Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision. pp. 5209–5217 (2017)
-  Zheng, H., Fu, J., Zha, Z.J., Luo, J.: Learning deep bilinear transformation for fine-grained image representation. In: Advances in Neural Information Processing Systems. pp. 4279–4288 (2019)
-  Zheng, H., Fu, J., Zha, Z.J., Luo, J.: Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5012–5021 (2019)
-  Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)