A Systematic Evaluation: Fine-Grained CNN vs. Traditional CNN Classifiers

by   Saeed Anwar, et al.
Australian National University

To make the best use of the underlying minute and subtle differences, fine-grained classifiers collect information about inter-class variations. The task is very challenging due to the small differences between the colors, viewpoint, and structure in the same class entities. The classification becomes more difficult due to the similarities between the differences in viewpoint with other classes and differences with its own. In this work, we investigate the performance of the landmark general CNN classifiers, which presented top-notch results on large scale classification datasets, on the fine-grained datasets, and compare it against state-of-the-art fine-grained classifiers. In this paper, we pose two specific questions: (i) Do the general CNN classifiers achieve comparable results to fine-grained classifiers? (ii) Do general CNN classifiers require any specific information to improve upon the fine-grained ones? Throughout this work, we train the general CNN classifiers without introducing any aspect that is specific to fine-grained datasets. We show an extensive evaluation on six datasets to determine whether the fine-grained classifier is able to elevate the baseline in their experiments.


Geo-Aware Networks for Fine Grained Recognition

Fine grained recognition distinguishes among categories with subtle visu...

A Systematic Evaluation of Recent Deep Learning Architectures for Fine-Grained Vehicle Classification

Fine-grained vehicle classification is the task of classifying make, mod...

On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual Recognition

The Fine-Grained Visual Categorization (FGVC) is challenging because the...

Nationality Classification Using Name Embeddings

Nationality identification unlocks important demographic information, wi...

Large-Scale and Fine-Grained Evaluation of Popular JPEG Forgery Localization Schemes

Over the years, researchers have proposed various approaches to JPEG for...

Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia

Contextual advertising provides advertisers with the opportunity to targ...

Fine-grained Recognition in the Noisy Wild: Sensitivity Analysis of Convolutional Neural Networks Approaches

In this paper, we study the sensitivity of CNN outputs with respect to i...

I Introduction

Fine-grain visual classification refers to the task of distinguishing the categories of the same class. Fine-grain classification is different from traditional classification as the former models intra-class variance while the later is about the inter-class difference. Examples of naturally occurring fine-grain classes include: birds

[1, 2], dogs [3], flowers [4], vegetables [5], plants [6] etc. while human-made categories include aeroplanes [7], cars [8], food [9] etc.

Fine-grain classification is helpful in numerous computer vision and image processing applications such as image captioning 

[10], machine teaching [11], and instance segmentation [12], etc.

Fine grain visual classification is a challenging problem as there are minute and subtle differences within the species of the same classes e.g., a crow and a raven, as compared to traditional classification where the difference between the classes is quite visible e.g., a lion and an elephant. Fine-grained classification of species or objects of any category is a herculean task for humans beings and usually requires extensive domain knowledge to identify the species or objects correctly.

Malamute Husky Eskimo
Crow Raven Jackdaw
Fig. 1: The difference between classes (inter-class variation) is limited for various classes.

Fig. 2: The intra-class variation is usually high due to pose, lighting, and color.

As mentioned earlier, fine-grained classification in image space aims to reduce the high intra-class variance and the low inter-class variance. We provide a few sample images from the dog and bird datasets in Figure 2 to highlight the difficulty of the problem. The examples in the figure show the images with the same viewpoint. The colors are also roughly similar. Although the visual variation is very limited between classes, all of these belong to different dog and bird categories. In Figure 2, we provide more examples of the same mentioned categories. Here, the difference in the viewpoint and colors are prominent. The visual variation is more significant as compared to the images in Figure 2, but these belong to the same class.

Many approaches have been proposed to tackle the problem of fine-grained classification; for example, earlier works converged on part detection to model the intra-class variations. Next, the algorithms exploited three-dimensional representations to hand multiple poses and viewpoints to achieve state-of-the-art results. Recently, with the advent of CNNs, most methods exploit the modeling capacity of CNNs as a component or as a whole.

This paper aims to investigate the capability of traditional CNN networks compared to specially designed fine-grained CNN classifiers. We strive to answer the question of whether current general CNN classifiers can achieve comparable performance to fine-grained ones. To show the competitiveness, we employ several fine-grained datasets and report top-1 accuracy for both classifier types. These experiments provide a proper place for general classifiers in fine-grained performance charts and will serve as baselines in future comparisons for fine-grained classification problems.

Ii Related works

Fine-grained visual classification is a vital and well-studied problem. The aim of fine grain visual classification is to differentiate between subclasses of the same category as opposed to the traditional classification problem where discriminative features are learned to distinguish between different classes. Some of the challenges in this domain are the following: i) The categories are highly correlated i.e. small differences and small inter-category variance to discriminate between subcategories. ii): Similarly, the intra-category variation can be significant, due to different viewpoints and poses. Many algorithms such as [13, 14, 15, 16, 17, 18, 19] are presented to achieve the desired results. In this section, we highlight the approaches which are recent and similar to our algorithm. The FGVC research can be mainly divided into three main branches which are reviewed in the following paragraphs.

Part-based FGVC algorithms [20, 21, 22, 23, 24, 25] rely on the distinguishing features of the objects to leverage the accuracy of visual recognition. These FGVC methods [26, 27] aim to learn the distinct features present in different parts of the object e.g. the differences present in the beak and tail of the bird species. Similarly, the part-based approaches normalize the variation present due to poses and viewpoints. Many works [28, 29, 1] assume the availability of bounding boxes at the object-level and the part-level in all the images during training as well as testing settings. To achieve higher accuracy, [22, 30, 31] employed both object-level and part-level annotations. These assumptions restrict the applicability of the algorithms to larger datasets. A reasonable alternative setting would be the availability of bounding box around the object of interest. Recently, Chai et al. [21] applied simultaneous segmentation and detection to enhance the performance of segmentation as well as object part localization. Similarly, a supervised method is proposed [16]

which locates the training images similar to a test image using KNN. The object part locations from the selected training images are regressed to the test image.

Succeeding algorithms take advantage of the annotated data during the training phase while requires no knowledge during the testing phase. These supervised methods learn on both object-level and object-parts level annotation in the training phase only. Such an approach is furnished in [32]

where only object-level annotations are given during training while no supervision is provided at object-parts level. Similarly, Spatial Transformer Network (STCNN) 

[33] handles data representation and outputs the location of vital regions. Furthermore, recent approaches focused on removing the limitation of previous works, aiming for conditions where the information about the location of the object-parts is not required either in training or testing phase. These FGVC methods are suitable for deployment on a large scale and are helpful for the advancement of research in this direction.

Recently, Xiao et al. [25]

presented two attention models to learn appropriate patches for a particular object and determine the discriminative object parts using deep CNN. The fundamental idea is to cluster the last CNN feature maps into groups. The object patches and object parts are obtained from the activations of these clustered feature maps. 

[25] needs the model to be trained on the category of interest, while we only require the general trained CNN. Similarly, DTRAM [34] learns to end the attention process for each image after a fixed number of steps. A number of methods are proposed to take advantage of object parts. However, the most popular one is the deformable part model (DPM) [35]

which learns the constellation relative to the bounding box with Support Vector Machines (SVM). Simon

et al. [36] improved upon [37] and employed DPM to localize the parts using the constellation provided by DPM [35]. Navigator-Teacher-Scrutinizer Network (NTSNet) [38] uses informative regions in images without employing any kind of annotations. Another teacher-student network is proposed recently as Trilinear Attention Sampling Network (TASN) , which is composed of a trilinear attention module, attention-based sampler, and a feature distiller.

Current fine-grain visual categorization state-of-the-art methods [39, 24] avoids the incorporation of the bounding boxes during testing and training phases. Zhang et al. [24] and Lin et al. [39] used a two-stage network for object and object-part detection and classification employing R-CNN and Bilinear-CNN respectively. Part Stacked CNN [18] adopts the same strategy as [39, 24] of a two-stage system; however the difference lies in stacking of the object-parts at the end for classification. Fu et al. [40] proposed multiple-scale RACNN to acquire distinguishing attention and region feature representations. Moreover, HIHCA [41] incorporated higher-order hierarchical convolutional activations via a kernel scheme.

Distance metric learning [42, 43, 44, 45] is an alternative approach to part-based algorithms and aims to cluster the data points/objects into the same category while moving different types away from each other. Bromley et al. [45] trained Siamese networks using deep metrics for signature verification. In this context, [45] set the trend in this direction. Recently, Qian et al. [42] employs a multi-stage framework which accepts pre-computed feature maps and learning the distance metric for classification. The pre-computed features can be extracted from DeCAF [43], as these features are discriminative and can be used in many tasks for classification. Dubey et al. [46] employs pairwise confusion (PC) via traditional classifiers.

Many works [47, 32, 25, 24] utilized the feature representations of a CNN and employed in many tasks [48, 49, 50]. A CNN captures the global information directly opposed to the traditional descriptors which capture local information and requires manual engineering to encode global representation. Destruction and Construction Learning (DCL) [51] takes advantage of a standard classification network and emphasize on discriminative local details. The model then reconstructs the semantic correlation among local regions. Zeiler & Fergus [49]

illustrated the reconstruction of the original image from the activations of the fifth max-pooling layer. Max-pooling ensures invariance to small-scale translation and rotation; however, robustness to larger-scale deformations might be achieved by global spatial information. To capture global information, Gong

et al. [52] combined the features from fully connected layers using VLAD pooling. Similarly, Cimpoi et al. [53] pooled the features from convolutional layers instead of fully connected layers for text recognition based on the idea the convolutional layers are transferable and are not domain specific. Following the footsteps of [53, 52], PDFR [17]

encoded the CNN filters responses employing a picking strategy via the combination of Fisher Vectors. However, considering feature encoding as an isolated element is not an optimum choice for convolutional neural networks.

Recently, feature integration is adopted by several approaches from different layers of the same CNN. The intuition behind feature integration is to take advantage of global semantic information captured by fully connected layers and instance level information preserved by convolutional layers [54]. Long et al. [55] merged the features from intermediate and high-level convolutional activations in their convolutional network to exploit both low-level details and high-level semantics for image segmentation. Similarly, for localization and segmentation, Hariharan et al. [56] concatenated the feature maps of convolutional layers at a pixel as a vector to form a descriptor. Likewise, for edge detection, Xie & Tu [57] added several feature maps from the lower convolutional layers to guide CNN and predict edges at different scales.

Fig. 3: Basic building block of the VGG [58]. ResNet [59] and DenseNet [60] architectures.

Iii Traditional Networks

To make the paper self-inclusive, we briefly provide the basic building blocks of the modern state-of-the-art traditional CNN architectures. These architectures can be broadly categorized into plain networks, residual networks, and densely connected networks. We review the most prominent and pioneering traditional networks which fall in each mentioned category and then adapt these models for the fine-grained classification task. The three architectures we investigate are VGG [58], ResNet [59], and DenseNet [60].

Iii-a Plain Network

Pioneering CNN architectures such as VGG [58] and AlexNet follows a single path i.e. without any skip connections. The success of AlexNet [61], inspired VGG [58]. Both of these networks rely on the smaller convolutional filters because a sequence of smaller convolutional filters achieves the same performance when compared to a single larger convolutional filter. For example, when four convolutional layers of 33 stacked together, it has the same receptive field as two 55 convolutional layers in sequence. Although, the large receptive field has fewer parameters than the smaller ones. The basic building block of VGG [58] architecture is shown in Figure 3(a).

VGG [58]

has many variants; we use the 19-layer convolutional network, which has shown promising results on ImageNet. As mentioned earlier, the block structure of VGG is planar (without any skip connection), and the number of feature channels are increased from 64 to 512.

No. of Images
Dataset Classes Train Val Test
NABirds [2] 555 23,929 - 24,633
Dogs [3] 120 12,000 - 8,580
CUB [1] 200 5,994 - 5,794
Aircraft [7] 100 3,334 3,333 3,333
Cars [8] 196 8,144 - 8,041
Flowers [4] 102 2,040 - 6,149
TABLE I: Details of eight fine-grained visual categorization datasets to evaluate the proposed method.

Iii-B Residual Network

To solve the vanishing gradients problem, residual network employed elements of the network with skip connections known as identity shortcuts, as shown in Figure 

3(b). The pioneering research in this direction is resnet [59].

The identity shortcuts help to propagate the gradient signal back without being diminished. The identity shortcuts theoretically “skip” over all layers and reach the initial layers of the network, learning the task at hand. Because of the summation of features at the end of each module, ResNet [59] learns only an offset and therefore, it does not require the learning of the full features. The identity shortcuts allow for successful and robust training of much deeper architectures than previously possible. Due to successful classification results, we compare two variants of ResNet (ResNet50 & ResNet152), with fine-grained classifiers.

Iii-C Dense Network

Building upon the success of ResNet [59], DenseNet [60] concatenates each convolutional layer in the modules, replacing the expensive element-wise addition and retaining the current features and from the previous layers through skipped connections. Furthermore, there is always a path for information from the last layer backward to deal with the gradient diminishing problem. Moreover, to improve computational efficiency, DenseNet [60] utilizes 11 convolutional layers to reduce the number of input feature maps before each 33 convolutional layer. Transition layers are applied to compress the number of channels that resulted from the concatenation operations. The building block of DenseNet [60] is shown in Figure 3(c).

The performance of DenseNet [60] on ILSVRC is comparable with ResNet [59] but it has significantly fewer parameters, thus requires less computations, e.g. DenseNet with 201 convolutional layers which has 20 million parameters, produces comparable validation error as a ResNet with 101 convolutional layers having 40 million parameters. Therefore, we consider DenseNet [60] a suitable candidate for fine-grained classification.

Iv Experimental Settings

CNN Methods Acc.


MGCNN [27] 81.7%
STCNN [33] 84.1%
FCAN [62] 84.3%
PDFR [17] 84.5%
RACNN [40] 85.3%
HIHCA [41] 85.3%
DTRAM [34] 86.0%
BilinearCNN [39] 84.1 %
PC-BilinearCNN [46] 85.6%
PC-DenseNet161 [46] 86.7%
MACNN [26] 86.5%
NTSNet [38] 87.5%
DCL-VGG16 [51] 86.9%
DCL ResNet50[51] 87.8%
TASN [63] 87.9%


VGG19 [58] 77.8%
ResNet50 [59] 84.7%
ResNet152 [59] 85.0%
NasNet [64] 83.0%
DenseNet161 [60] 87.7%
TABLE II: Comparison of the state-of-the-art fine grain classification on CUB [1] dataset

Iv-a Datasets

In this section, we provide the details of the six most prominent fine-grain datasets used for evaluation and comparison against the current state-of-the-art algorithms.

  • Birds: The birds’ datasets that we compare on are Caltech-UCSD Birds-200-2011, abbreviated as CUB [1] is composed of 11,788 photographs of 200 categories which further divided into 5,994 training and 5,794 testing images. The second dataset for birds fine-grained classification is North American Birds, generally known as NABirds [2], is the largest in this comparison. NABirds [2] has 555 species found in North America with 48562 photographs.

  • Dogs: The Stanford Dogs [3] is a subset of ImageNet [65] gathered for the task of fine-grained categorization. The dataset composed of 12k training and 8,580 testing images.

  • Cars: The cars dataset [8] has 196 classes with different make, model, and year. It has a total number of 16185 car photographs where the split is 8,144 training images and 8,041 testing images i.e. roughly 50% for both.

  • Aeroplanes: A total of 10,200 images with 102 variants having 100 images for each are present in the fine-grained visual classification of Aircraft i.e. FGVC-aircraft dataset [7]. Airplanes are an alternative to objects considered for fine-grained categorization such as birds and pets.

  • Flowers: The number of classes in the flower [4] dataset is 102. The training images are 2,040, while the testing images are 6,149. Furthermore, there are significant variations within categories while having similarities with other categories.

Table I summarizes the number of classes and the number of images, including the data split for training, testing, and validation (if any) for the fine-grain visualization datasets.

V Evaluations

V-a Performance on CUB Dataset

We present the comparisons on the CUB dataset [1] in Table II. The best performer on this dataset is DenseNet [60], and this is unsurprising because the model concatenates the feature maps from preceding layers to preserve details. The worst performing among the traditional classifiers is NasNet [64], maybe due to its design, which is more inclined towards a specific dataset (i.e. ImageNet [65]). The ResNet models [59] perform relatively better than NasNet [64], which shows that networks with shortcut connections surpass in performance those with multi-scale representations for fine-grained classification. DenseNet [60] offers high accuracy as compared to ResNet [59] because the former do not fuse the feature and carries the details forward unlike the later where the features are combined in each block.

CNN Methods Aircraft [7] Cars [8]


FVCNN [66] 81.5% -
FCAN [62] - 89.1%
BilinearCNN [39] 84.1% 91.3%
RACNN [40] 88.2% 92.5%
HIHCA [41] 88.3% 91.7%
PC-Bilinear [46] 85.8% 92.5%
PC-ResNet50 [46] 83.4% 93.4%
PC-DenseNet161 [46] 89.2% 92.7%
MACNN [26] 89.9% 92.8%
DTRAM [34] - 93.1%
TASN [63] - 93.8%
NTSNet [38] 91.4% 93.9%


VGG19 [58] 85.7% 80.5%
ResNet50 [59] 91.4% 91.7%
ResNet152 [59] 90.7% 93.2%
NasNet [64] 88.5% -
DenseNet161 [60] 92.9% 94.5%
TABLE III: Experimental results on FGVC Aircraft [7] and Cars [8].

The fine-grained classification literature consider CUB-200-2011 [1] as a standard benchmark for evaluation; therefore, image-level labels, bounding boxes, and different types of annotations are employed to extract the best results on this dataset. Similarly, multi-branch networks focusing on various parts of images, and multiple objective functions are combined for optimization. On the contrary, the traditional classifiers [60, 59] use a single loss without any extra information or any other annotations. The best-performing fine-grained classifiers for CUB [1] are DCL ResNet50[51], TASN [63], and NTSNet [38] where 0.1% and 0.2% gain is recorded over DenseNet [60] for DCL ResNet50 [51] and TASN [63], respectively, while NTSNet [38] lags by a margin of 0.2%. The improvement over DenseNet [60] is insignificant, keeping in mind the different computationally expensive tactics employed to learn the distinguishable features by fine-grained classifiers.

CNN Methods Dogs [3] Flowers [4] NABirds [2]


Zhang et al. [67] 80.4% - -
Krause et al. [68] 80.6% - -
Det.+Seg. [69] - 80.7% -
Overfeat [50] - 86.8% -
Branson et al. [47] - - 35.7%
Van et al. [2] - - 75.0%
BilinearCNN [39] 82.1% 92.5% 80.9%
PC-ResNet50 [46] 73.4% 93.5% 68.2%
PC-BilinearCNN [46] 83.0% 93.7% 82.0%
PC-DenseNet161 [46] 83.8% 91.4% 82.8%


VGG19 [58] 76.7% 88.73% 74.9%
ResNet50 [59] 83.4% 97.2% 79.6%
ResNet152 [59] 85.2% 97.5% 84.0%
DenseNet161 [60] 85.2% 98.1 86.3%
TABLE IV: Comparison of the state-of-the-art fine grain classification on Dogs [3], Flowers [4] and NABirds [2] dataset.

V-B Quantitative analysis on Aircraft and Cars

In Table III, the performances of fine-grained classifiers are shown on Cars [8] and Aircraft [7] datasets. Here, we also observe that the performance of the traditional classifiers is better than the fine-grained classifiers. DenseNet161 [60] has an improvement of about 1.5% and 3% on Aircraft [7] compared to best-performing NTSNet [38] and MACNN [26], respectively. Similarly, an improvement of 0.6% and 1.4% is recorded against NTSNet [38] and DTRAM [34] on Cars [8], respectively. The fine-grain classifiers [38, 34, 63] fail to achieve the same accuracy as the traditional classifiers, although the former employ more image-specific information for learning.

V-C Comparison on Stanford Dogs

The Stanford dogs dataset [3] is another challenging dataset where the performance is compared in Table IV. Here, we utilize ResNet [59] and DenseNet [60] from the traditional ones. The performance of ResNet [59] composed of 152 layers is similar to DenseNet [60] with 161 layers, both achieved 85.2% accuracy, which is 1.4% higher than PC-DenseNet161 [46], the best performing method in fine-grained classifiers. This experiment suggests that the incorporation of traditional classifiers in the fine-grained ones requires more insight rather than just utilizing it in the framework. It is also worth mentioning that some of the fine-grained classifiers employ a large amount of data from other sources in addition to the Stanford dogs training data.

V-D Results of Flower dataset

The accuracy of DenseNet [60] on the Flower dataset [4] is 98.1% which is around 5.5% higher as compared to the second-best performing state-of-the-art method (PC-ResNet50 [46]) in Table  IV. Similarly, the other traditional classifiers also outperform the fine-grained ones by a significant margin. It should also be noted that the performance on this dataset is approaching saturation.

V-E Performance on NABirds

Relatively fewer methods have reported their results on this dataset. However, for sake of completeness we provide comparisons on the NABirds [2] dataset. Again the leading performance on NABirds [2] is achieved by DenseNet161 [60], followed by ResNet152 [59]. The third-best performer is a fine-grain classifier i.e. PC-DenseNet161 [46], which internally employes DenseNet161 [60] lags behind by 3.5%. This shows the superior performance of the traditional CNN classifiers against state-of-the-art fine-grained CNN classifiers.

V-F Ablation studies

Here, we present two strategies for training traditional CNN classification networks i.e. fine-tuning the weights via ImageNet [65] and training from scratch (randomly initializing the weights) for the Car dataset. The accuracy presented for each is given in Table V. The resnet50 [59] achieves higher accuracy when fine-tuned as compared to the randomly initialized version. Similarly, ResNet152 [59] performed better for the fine-tuned network; however, it fails when trained from scratch. The reason may be due to a large number of parameters and smaller training data.

Initial Methods
Weights ResNet50 ResNet152
Scratch 83.4% 36.9%
Fine-tune 91.7% 93.2%
TABLE V: Differences strategies for initialing the network weights i.e. fine-tuning from ImageNet and random initialization (scratch) for Cars [8] dataset.

Vi Conclusion

In this paper, we provided comparisons between state-of-the-art traditional CNN classifiers and fine-grained CNN classifiers. It has been shown that traditional models achieve state-of-the-art performance on fine-grained classification datasets and outperform the fine-grained CNN classifiers. Therefore, it is necessary to update the baselines for comparisons. It is also important to note that the performance increase is due to the initial weights trained on the ImageNet [65] datasets. Furthermore, we have established that the DenseNet161 model achieves new state-of-the-art results for all datasets outperforming the fine-grained classifiers by a significant margin.


  • [1] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” California Institute of Technology, Tech. Rep., 2011.
  • [2] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in CVPR, 2015.
  • [3] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for fgvc: Stanford dogs,” in CVPR Workshop, 2011.
  • [4] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in ICVGIP, 2008.
  • [5] S. Hou, Y. Feng, and Z. Wang, “Vegfru: A domain-specific dataset for fine-grained visual categorization,” in ICCV, 2017.
  • [6] J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Perona, “Cataloging public objects using aerial and street-level images-urban trees,” in CVPR, 2016.
  • [7] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv, 2013.
  • [8] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in CVPR Worshops, 2013.
  • [9] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang, “Pfid: Pittsburgh fast-food image dataset,” in ICIP, 2009.
  • [10]

    N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah, “Video description: A survey of methods, datasets, and evaluation metrics,”

    ACM Computing Surveys (CSUR), vol. 52, no. 6, p. 115, 2019.
  • [11] G. C. Spivak, Outside in the teaching machine.   Routledge, 2012.
  • [12] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for instance segmentation,” in CVPR, 2018.
  • [13] A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for large-scale subcategory flower recognition,” in WACV, 2013.
  • [14] Y. Chai, E. Rahtu, V. Lempitsky, L. Van Gool, and A. Zisserman, “Tricos: A tri-level class-discriminative co-segmentation method for image classification,” in ECCV, 2012.
  • [15] J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-grained recognition,” in CVPR, 2013.
  • [16] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars, “Fine-grained categorization by alignments,” in ICCV, 2013.
  • [17] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in CVPR, 2016.
  • [18] S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for fine-grained visual categorization,” in CVPR, 2016.
  • [19] X. Zhang, F. Zhou, Y. Lin, and S. Zhang, “Embedding label structures for fine-grained feature representation,” in CVPR, 2016.
  • [20] S. Yang, L. Bo, J. Wang, and L. G. Shapiro, “Unsupervised template learning for fine-grained object recognition,” in NIPS, 2012.
  • [21] Y. Chai, V. Lempitsky, and A. Zisserman, “Symbiotic segmentation and part localization for fine-grained categorization,” in ICCV, 2013.
  • [22]

    T. Berg and P. Belhumeur, “Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation,” in

    CVPR, 2013.
  • [23] N. Zhang, R. Farrell, F. Iandola, and T. Darrell, “Deformable part descriptors for fine-grained recognition and attribute prediction,” in ICCV, 2013.
  • [24] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in ECCV, 2014.
  • [25] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” in CVPR, 2015.
  • [26] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolutional neural network for fine-grained image recognition,” in ICCV, 2017.
  • [27] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple granularity descriptors for fine-grained categorization,” in ICCV, 2015.
  • [28] O. M. Parkhi, A. Vedaldi, C. Jawahar, and A. Zisserman, “The truth about cats and dogs,” in ICCV, 2011.
  • [29] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in CVPR, 2012.
  • [30] J. Liu, A. Kanazawa, D. Jacobs, and P. Belhumeur, “Dog breed classification using part localization,” in ECCV, 2012.
  • [31] L. Xie, Q. Tian, R. Hong, S. Yan, and B. Zhang, “Hierarchical part matching for fine-grained visual categorization,” in ICCV, 2013.
  • [32] J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in CVPR, 2015.
  • [33] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in NIPS, 2015.
  • [34] Z. Li, Y. Yang, X. Liu, F. Zhou, S. Wen, and W. Xu, “Dynamic computational time for visual attention,” in ICCV, 2017.
  • [35] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in CVPR, 2008.
  • [36] M. Simon and E. Rodner, “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” in ICCV, 2015.
  • [37] M. Simon, E. Rodner, and J. Denzler, “Part detector discovery in deep convolutional neural networks,” in ACCV, 2014.
  • [38] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, and L. Wang, “Learning to navigate for fine-grained classification,” in ECCV, 2018.
  • [39] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in ICCV, 2015.
  • [40] J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in CVPR, 2017.
  • [41] S. Cai, W. Zuo, and L. Zhang, “Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization,” in ICCV, 2017.
  • [42] Q. Qian, R. Jin, S. Zhu, and Y. Lin, “Fine-grained visual categorization via multi-stage metric learning,” in CVPR, 2015.
  • [43] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
  • [44]

    Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, “Large scale fine-grained categorization and domain-specific transfer learning,” in

    CVPR, 2018.
  • [45] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a” siamese” time delay neural network,” in NIPS, 1994.
  • [46] A. Dubey, O. Gupta, P. Guo, R. Raskar, R. Farrell, and N. Naik, “Pairwise confusion for fine-grained visual classification,” in ECCV, 2018.
  • [47] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species categorization using pose normalized deep convolutional nets,” arXiv, 2014.
  • [48] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
  • [49] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in ECCV, 2014.
  • [50] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn features off-the-shelf: an astounding baseline for recognition,” in CVPR workshops, 2014.
  • [51] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in CVPR, 2019.
  • [52] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep convolutional activation features,” in ECCV, 2014.
  • [53] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recognition and segmentation,” in CVPR, 2015.
  • [54]

    A. Babenko and V. Lempitsky, “Aggregating local deep features for image retrieval,” in

    ICCV, 2015.
  • [55] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [56] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” in CVPR, 2015.
  • [57] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, 2015.
  • [58] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.
  • [59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [60] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
  • [61] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [62] X. Liu, T. Xia, J. Wang, Y. Yang, F. Zhou, and Y. Lin, “Fully convolutional attention networks for fine-grained recognition,” arXiv, 2016.
  • [63] H. Zheng, J. Fu, Z.-J. Zha, and J. Luo, “Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition,” in CVPR, 2019.
  • [64] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in CVPR, 2018.
  • [65] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
  • [66] P.-H. Gosselin, N. Murray, H. Jégou, and F. Perronnin, “Revisiting the fisher vector for fine-grained classification,” PRL, 2014.
  • [67] Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do, “Weakly supervised fine-grained categorization with part-based image representation,” TIP, 2016.
  • [68] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei, “The unreasonable effectiveness of noisy data for fine-grained recognition,” in ECCV, 2016.
  • [69] A. Angelova and S. Zhu, “Efficient object detection and segmentation for fine-grained recognition,” in CVPR, 2013.