), which has attracted increasing attention in computer vision and pattern recognition. Compared with general object classification, this task is extremely challenging due to the large variance in the same subcategory and small variance among different subcategories. Since these subcategories are similar in global appearances, different subcategories can only be distinguished by subtle visual differences existed in local regions of key parts, such as the shape of beak, the color of foot and the texture of feather for bird. Thus, localizing object and discriminative parts is highly essential for fine-grained image classification.
Inspiringly, a majority of fine-grained image classification methods have incorporated part localization and achieve significant progress. However, most earlier works [3, 11, 36, 39, 41] still utilize strong supervision of human-labeled object annotation (i.e., bounding box of object) or part annotations (i.e., part locations). Since the object and part annotations are laborious and expensive, many works [37, 43, 42, 20, 44, 7, 45] address part localization under a weakly-supervised setting with only image-level labels. Those methods can be roughly divided into two dimensions: two-stage methods which perform part localization and fine-grained classification separately, and end-to-end training methods which jointly learn discriminative part localization and fine-grained feature representation. Most of two-stage methods [43, 37, 42, 40] use region proposals  as candidate regions to localize the discriminative parts, which may lead to low accuracy and high time consumption. Recently, [7, 45] propose the end-to-end framework where part localization and feature learning could mutually reinforce each other. Although promising results have been reported, it is highly difficult to train the models due to sophisticated alternative training procedures.
To deal with the above problems, we propose an unsupervised part mining (UPM) approach for fine-grained image classification. Our proposed part localization method is fully unsupervised, without any annotations even image-level labels. The key idea of our proposed UPM is to explore the distinctive parts from the pattern mining perspective. To realize the idea, we reuse the pre-trained CNN model which has powerful abilities of representation, and further employ pattern mining techniques to effectively mine frequently-occurring visual patterns from a large number of CNN activations. These mined patterns are highly corresponding to the possible parts, which could be exploited to boost the classification performance. Our proposed UPM approach is simple but effective, which does not require complex and long-time training process. Meanwhile, we have no dependencies on any annotations including image-level labels, and thus it greatly increases the usability and scalability of fine-grained classification.
Our approach consists of a part localization module and a part-based classification module, as shown in Figure 1
. In part localization module, we reuse a pre-trained CNN model and propose to employ pattern mining techniques for localizing the possible parts without using any annotations. Specifically, we first convert the deep features from multiple convolutional layers of a pre-trained CNN model (e.g., VGG-16 ) into a set of transactions, and then discover the co-occurrence patterns through pattern mining techniques. We observe that the relevant patterns generally correspond to representative local regions in one image. Motivated by this observation, we utilize simple clustering algorithms (e.g
., k-means) to cluster the mined patterns with frequency information into multiple clusters. Finally, the regions surrounding cluster centers are the key parts for a given image and can be further used for fine-grained image classification. In part-based classification module, these localized parts are further clustered based on deep features and fed into a deep classification network, in which a multi-stream architecture is built to aggregate different level features for subsequent fine-grained classification. Our main contributions can be summarized as follows:
We present a novel and effective unsupervised part localization approach, without any image-level labels, which is the key issue for fine-grained image classification. The experimental results show that the localized parts contribute to the final classification accuracy.
To the best of our knowledge, we propose the first usage of pattern mining for fine-grained image classification successfully, which fully exploit information from convolutional activations of a pre-trained CNN model.
We conduct comprehensive experiments on three challenging fine-grained datasets (Caltech-UCSD Birds, Stanford Cars and FGVC-Aircraft), and achieve competitive performance compared with the state-of-the-art methods.
2 Related Work
2.1 Fine-grained Image Classification
Fine-grained image classification is a fundamental and important task in computer vision, and a large amount of works have been developed in the past few years. Benefited from the advancement of deep learning, many works[17, 29, 41, 37, 19] learn more discriminative feature representation by leveraging deep CNNs, and achieve significant progress.
Since subtle visual differences mostly reside in local regions of parts, discriminative part localization is crucial for fine-grained image classification. There are numerous emerging works proceeding along part localization. [41, 11, 39, 36] learn accurate part localization models with manual object bounding boxes and part annotations. Considering that the annotations are laborious and expensive, some works [43, 42, 10, 37, 44, 20, 7, 45] begin to focus on how to exploit parts under a weakly-supervised setting with only image-level labels.  proposes an automatic fine-grained classification method, incorporating deep convolutional filters with significant and consistent responses for both parts selection and representation. Some of the above part localization-based methods [43, 37, 42, 40] usually require to firstly produce object or part candidates by selective search , which poses challenges to accurate part localization.
Additionally, some weakly-supervised methods [25, 37, 44, 20, 7, 45] use visual attention mechanism to automatically capture the informative regions.  employs a fully convolutional attention network to adaptively localize multiple parts simultaneously. Recent works [7, 45] propose the end-to-end framework where part localization and feature learning could mutually reinforce each other. Although promising results have been reported, it is highly difficult to train the models due to sophisticated alternative training procedures.
Compared with previous efforts, our UPM approach can accurately localize the parts in a fully unsupervised way without even image-level labels, thus it does not need sophisticated training procedures. Moreover, it also does not rely on enormous region proposals. In addition, it is worth to note that NAC  also considers the part localization in a fully-unsupervised manner without image-level annotations, which is similar to our work. However, our proposed method can directly localize multiple fine-grained parts instead of selecting useful ones from part proposals, and outperform NAC by a large margin.
2.2 Pattern mining in Computer Vision
Pattern mining is one of the most intensively investigated problems in data mining domain. Generally, a set of patterns is a combination of several elements, which capture the distinctive information. Inspired by this fact, more researchers rise to investigate the problem of employing pattern mining to address computer vision tasks, including image classification [6, 18], image collection summarization  and object retrieval .
A key issue of pattern mining is how to transform an image into transactions, which could retain the discriminative information as much as possible and also guarantee that those transactions should be suitable for pattern mining. Earlier works [23, 1] simply treat an individual visual word as an item in a transaction by adopting local bag-of-words as image representation.  proposes a frequent local histograms method to represent an image with the histograms of patterns sets. Recently,  is a pioneering work to illustrate how pattern mining techniques are combined with the CNN features. In , a local patch is transformed into a transaction by treating each dimension index of a CNN activation from fully-connected layer as an item.
In this section, we present the approach overview as shown in Figure 1. The approach is composed of an unsupervised part localization module (Section 3.2) and a part-based classification module (Section 3.3). In the first module, we aim to obtain the location of parts. The innovation of our approach is to localize discriminative parts by employing pattern mining techniques on the feature maps of a pre-trained CNN model. In the second module, we rely on the part locations to learn a joint feature representation and conduct part-based classification.
The following notations and terminology of data mining are used in the rest of this paper. Let denotes an itemset containing M items. A transaction T is a subset of I that satisfies to , where is the number of items in T. A transaction database is defined as , where . Given an itemset , we define the support value of P as:
where measures the cardinality. The support value of pattern P indicates that how many transactions containing pattern P in , i.e., . P is regarded as a frequent itemset when its support value is larger than a predefined threshold.
3.2 Unsupervised Part Localization
The goal of part localization is to obtain a collection of discriminative parts for a given fine-grained image. High-level convolutional layers can learn semantic cues, i.e., meaningful patterns, which correspond to whole objects  or parts of objects . Inspired by the observation, we propose a fully unsupervised part mining approach where the parts are discovered directly from activations of a pre-trained CNN model through pattern mining techniques without any labels. Note that the pre-trained model is not fine-tuned on the interest fine-grained dataset.
Figure 2 illustrates the pipeline of our UPM approach. We first extract feature maps from pool5 and relu5 layers of a pre-trained VGG-16  model, and then adopt pattern mining techniques to discover frequent patterns in these feature maps. Finally we perform the clustering algorithm on mined patterns and generate the parts surrounding the corresponding cluster centers.
3.2.1 Transaction Creation
In order to apply pattern mining techniques to part localization task, the process of transforming the image into a set of transactions while retaining useful information is a key issue that must be tackled.
Given an input image I, we first feed it into a pre-trained VGG-16  model and extract feature maps from pool5 and relu5 layers in Figure 2 (c). We observe that most semantic parts of a bird are frequently fired at the same location in the feature maps. Moreover, the activations of two specific layers complement each other very well. Thus, we adopt a multi-layer combination strategy to alleviate the loss of useful information caused by only considering single layer activations. Besides, we need to resize feature maps to the same size of
by bilinear interpolation, and we obtain 1,024 feature maps in total.
The dimension of each feature map is , where and
indicate width and height of the feature map respectively. To simplify the process of creating transactions, we stretch each feature map into a vector. In our UPM approach, each feature map is taken as a transaction , and each position index activated from the feature map is considered as an item (). For example, if there are five positions activated from a feature map, the corresponding transaction contains five items denoted as . The set of all transactions is denoted as and . The index set of all positions activated from feature maps, also known as an itemset, is denoted by . Generally, .
Next, we select the meaningful descriptors in Figure 2 (d) to convert them into items. Specifically, we calculate the mean value of the CNN activation responses larger than 0 as the tunable threshold instead of a fixed threshold in . The position whose response value is higher than is highlighted and its index will be converted into an item. Those indexes of all highlighted positions in one feature map finally form a transaction in Figure 2 (e).
3.2.2 Pattern Mining
Once a set of transactions in Figure 2 (e) are created, we utilize the Apriori algorithm  to discover frequent items (i.e., patterns). For a given minimum support threshold , if , the itemset P is considered as a pattern in Figure 2 (f). Note that the support value of the pattern indicates the frequency of this pattern appearing in all feature maps. Thus, the appropriate value of guarantees that we can mine the most representative and discriminative patterns.
3.2.3 Part Mining
Based on these mined patterns, we first select the largest connected component to remove those isolated patterns indicating background regions and merge the patterns to generate the support map. Subsequently, we conduct clustering algorithm on the support map to localize multiple parts simultaneously. Finally, we adopt a simple and effective geometric constrains to crop a square surrounding each cluster center as a part region. Next, we present the details of part mining.
Generating support map. In our UPM approach, a mined pattern corresponds to a region in one image as shown in Figure 2 (f) and some relevant patterns generally indicate prominent representative local regions (e.g., the head of bird). Besides, we find that the isolated regions represented by one pattern or multiple patterns usually belong to the background of an image. Thus, we select the largest connected component based on all mined patterns to remove those isolated patterns.
Here we introduce a new concept, support map, whose size is same with the feature map of relu5 layer. Note that the support map in Figure 2 (g) is generated by merging relevant and non-redundant patterns. Suppose that we have mined patterns denoted as , the support map is defined as:
where denotes the frequency of an item represented by its position . To obtain the support map with the same size as the original image, we upsample the support map by bilinear interpolation. The support map indicates how many times each item would be activated from all feature maps. More importantly, the higher value of the position, the more likely its corresponding region could be a part of the object.
Finding part regions by clustering. Inspired by the observation that some relevant patterns generally correspond to representative local regions (e.g., the head of bird) and the local regions are spatially continuous, thus we can divide the regions into several groups of spatial locations. An intuitive idea is to perform the clustering algorithm on the support map. Specifically, we first produce the clustering data, which are three-dimensional data including the coordinates of each spatial location and its corresponding support map value . Then we take them as input of the k-means algorithm to cluster these connected regions into clusters, as shown in Figure 2 (h). Surprisingly, the local regions represented by the patterns belonging to one cluster can be regarded as a discriminative part for a fine-grained image. Therefore, we obtain part locations in the original image, where denotes the coordinates of the part.
After getting the part locations, then parts are generated by cropping squares from I, with each element of C as the square center. However, if the side length of the part square is simply fixed, some cropped parts may only include a small part but be disturbed by large background noises. In addition, a fixed-size part may lead to serious overlap with other parts. Therefore, in order to tackle the issues and generate more representative and distinctive parts, we consider a simple and effective geometric constrains to determine the side length of a part as follows:
where and are width and height of the bounding box generated from the support map respectively, and is a scale factor. Finally, we can define the part region mask as:
Thus, the cropped part region can be computed as:
where denotes element-wise multiplication. Each part region is amplified into for subsequent part-based classification.
Algorithm 1 gives the details of part mining.
3.3 Part-based Classification
The different level focuses (i.e., image-level, object-level and part-level) have different representations and are complementary to improve the classification performance. Therefore, we build a multi-stream architecture with an Image stream, an Object stream and a Part stream to learn a joint feature representation, as shown in Figure 1. Since previous works [41, 20, 7, 45] indicate the benefits of region zooming, we amplify the original image to a higher resolution . These images are taken as input to train a classification network based on the original image.
Object stream. Object localization can eliminate the influence of noisy background to learn representative object features. Thus, we also consider object localization in our method. Actually, we observe that the support map in Section 3.2 could indicate the representative object regions, as shown in Figure 2
(g). So it is reasonable to generate the object region from the support map. Specifically, we perform binarization and connectivity area extraction on the support map, which is similar to CAM . Finally, the images are cropped and resized into a fixed size of to train a classification model based on the object-level images.
Part stream. Since the parts can capture the subtle and local discrimination within two similar subcategories, we train a set of classification models based on part-level images, each of which conducts classification on one part separately.
For the training set containing images, parts are obtained by our UPM approach. However, these parts are out-of-order and not aligned by its semantic meaning. Therefore, we need to align these parts with the same semantic meaning together, so as to provide the training datasets for multiple part-level models. We are inspired by the fact that different convolutional layers learn different level features . Generally speaking, the higher deep convolutional layers carry more discriminative power and thus more likely to learn semantic cues (meaningful patterns, e.g. bird’s head or dog’s face). An intuitive idea is that we can utilize clustering techniques to obtain the part clusters based on convolutional feature space.
For clear expression, we denote the part mask as , which represents the part mask of the training image. Specifically, we first extract convolutional features by feeding the original image I into a classification model trained on interest dataset (e.g., conv5_4 layer of VGG-19 ). The extracted deep features are denoted as , where represents a set of operations of convolution, pooling and activation, and W represents the overall parameters of the model. Then we resize the part mask in Section 3.2.3 to the same size of . The features corresponding to the part region of the training image can be represented as:
To reduce the dimension of features, global average pooling (GAP)  is performed on the above features. Finally, we obtain
feature descriptors and perform the spectral clustering algorithm on them to partition those corresponding parts intogroups. Each part-level CNN model is fine-tuned on corresponding parts separately.
Joint feature representation: In our work, we leverage the feature ensemble strategy. The final feature representation can be represented as:
where , and denote the feature descriptors of the original image, the object image and the part respectively. Each feature descriptor is extracted from the last convolutional layer of corresponding classification network. We first perform GAP and
-normalization on each feature descriptor, and concatenate them to train a classifier for the final classification.
4.2 Implementation Details
In our unsupervised part localization module, the input image is resized to , and then fed into a publicly available VGG-16 
model pre-trained on ImageNet to extract feature maps fromrelu5 and pool5 layers. The minimum support threshold is set to 0.07, 0.06 and 0.05 on CUB-200-2011, Stanford Cars and FGVC-Aircraft datasets respectively. The number of parts is set to 4. The in Eqn. (3) is empirically set to , which makes the parts more representative.
In the part-based classification experiments, we use VGG-19  and ResNet-50  as the baseline models. We first train an image-level classification model based on full-size images of . Then, we adopt our proposed UPM approach to generate object-level and part-level training samples. Afterwards, we use these samples to fine-tune the image-level model to obtain an object-level model and four part-level models respectively. The input size of the object-level and part-level models are and respectively. The output of each CNN is extracted by GAP from the last convolutional layer to generate the feature descriptor in Section 3.3. All feature descriptors are concatenated into a representation to train a linear SVM classifier  for classification. We run experiments with MatConvNet 
|Method||Anno. in part localization||Acc.(%)|
4.3 Experiment on CUB-200-2011
In this section, we compare our proposed UPM with the baseline methods and the state-of-the-arts on CUB-200-2011. The comparison results are summarized in Table 1.
Benefited from the localized parts by our UPM approach as shown in Figure 4 (a), UPM (VGG-19) and UPM (ResNet-50) surpass the baseline models ResNet-50  and VGG-19  with 3.0% and 2.5% relative improvement respectively due to the effectiveness of part mining. Our approach outperforms most of the methods with strong supervision including bounding box, part annotation and image-level label listed in the Table 1. Compared with the strong-supervised methods [39, 3, 36], our approach achieves the comparable results without any annotations.
Compared with the weakly-supervised methods only with the image-level label, our approach is simple and does not need any annotations, but we still achieve comparable results. We outperform PDFR , DVAN  and FCAN  by 0.9%, 6.4% and 1.1% respectively. We are only lower 1.1% than the recent MA-CNN  which jointly learns part proposals and feature representation. However, our UPM approach can localize the parts in a fully unsupervised way even without image-level annotations, thus, unlike MA-CNN, we do not need the sophisticated training process.
UPM (ResNet-50) achieves the state-of-the-art results among methods under the same setting that are fully-unsupervised without any annotations. Compared with NAC , UPM (ResNet-50) achieves accuracy with 4.4% relative improvement, which demonstrates that incorporating pattern mining techniques can efficiently mine the discriminative parts in an unsupervised manner.
4.4 Experiment on Standford Cars
We further evaluate the performance of our proposed method on the Standford Cars dataset. The results of part localization are shown in Figure 4 (b). The classification results are summarized in Table 2. UPM (ResNet-50) obtains 1.0% higher accuracy than FCAN (with Object Anno.) . Besides, our approach achieves the competitive results compared with [35, 15], which use bounding box annotations. This benefits from the representativeness of support map and the effectiveness of pattern mining techniques. Furthermore, our approach outperforms most of the weakly-supervised methods which use image-level labels, such as DVAN , FCAN (w/o Object Anno.)  and OPAM . Compared with FCAN (w/o Object Anno.) , the relative 3.2% accuracy gain from UPM (ResNet-50) shows the significance of our mined parts in an unsupervised way. Moreover, our approach surpasses B-CNN , which uses high dimensional features and requires image-level labels, with nearly 1.0% relative accuracy gain.
|Method||Anno. in part localization||Acc.(%)|
4.5 Experiment on FGVC-Aircraft
Considering the simple background of aircraft images, we obtain good object localization results as shown in Figure 3. Therefore, the four localized parts are highly discriminative as shown in Figure 4 (c). The classification results on FGVC-Aircraft dataset are summarized in Table 3. Our approach achieves superior performance over the state-of-the-art methods. Our approach outperforms our baseline models by 2.7% and 3.3%, respectively. Compared with MG-CNN  relying on object annotations, the 3.4% clear margin from UPM (ResNet-50) shows the effectiveness of our UPM. We even surpass B-CNN (w/o Object Anno.)  utilizing high dimensional features with nearly 5.9% relative accuracy gains. It is worth to note that compared with MA-CNN  which relies on multiple alternative training stage, our approach can localize the parts in an unsupervised manner, but we still achieve better accuracy.
|Method||Anno. in part localization||Acc.(%)|
4.6 Further Analysis
We further show the quantitative comparison in Table 4 to verify the performance of the streams used in our UPM approach. We can observe that our UPM (ResNet-50) approach outperforms the “Original-stream+Object-stream” with 1.0% relative gains due to the complementarity with the original and object image, which shows the effectiveness of the localized parts through our UPM approach.
|Our UPM (ResNet-50) approach||85.4|
In this paper, we propose a fully unsupervised part mining approach for fine-grained image classification, which explores the discriminative parts by incorporating the pattern mining techniques. We employ the pattern mining techniques to discover frequent patterns in the feature maps extracted from a pre-trained CNN model and perform the clustering algorithm on mined patterns to generate the parts. The proposed approach does not require any annotations even image-level labels in part localization, and does not require sophisticated training procedures. Extensive experiments show the effectiveness of UPM compared with other state-of-the-arts on three challenging fine-grained datasets.
-  A. Agarwal and B. Triggs. Multilevel image coding with hyperfeatures. International Journal of Computer Vision, 78(1):15–27, 2008.
-  R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB, volume 1215, pages 487–499, 1994.
-  S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952, 2014.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
Liblinear: A library for large linear classification.
Journal of machine learning research, 9(Aug):1871–1874, 2008.
-  B. Fernando, E. Fromont, and T. Tuytelaars. Mining mid-level features for image classification. International Journal of Computer Vision, 108(3):186–203, 2014.
B. Fernando and T. Tuytelaars.
Mining multiple queries for image retrieval: On-the-fly learning of an object-specific mid-level representation.In Proceedings of the IEEE International Conference on Computer Vision, pages 2544–2551, 2013.
-  J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, volume 2, page 3, 2017.
-  P.-H. Gosselin, N. Murray, H. Jégou, and F. Perronnin. Revisiting the fisher vector for fine-grained classification. Pattern Recognition Letters, 49:92–98, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  X. He, Y. Peng, and J. Zhao. Fine-grained discriminative localization via saliency-guided faster r-cnn. In Proceedings of the 2017 ACM on Multimedia Conference, pages 627–635. ACM, 2017.
-  S. Huang, Z. Xu, D. Tao, and Y. Zhang. Part-stacked cnn for fine-grained visual categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1173–1182, 2016.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), volume 2, page 1, 2011.
-  J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grained recognition without part annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5546–5555, 2015.
-  J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Y. Li, L. Liu, C. Shen, and A. Van Den Hengel. Mining mid-level visual patterns with deep cnn activations. International Journal of Computer Vision, 121(3):344–364, 2017.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1449–1457, 2015.
-  X. Liu, T. Xia, J. Wang, Y. Yang, F. Zhou, and Y. Lin. Fully convolutional attention networks for fine-grained recognition. arXiv preprint arXiv:1603.06765, 2016.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Y. Peng, X. He, and J. Zhao.
Object-part attention model for fine-grained image classification.IEEE Transactions on Image Processing, 27(3):1487–1500, 2018.
-  T. Quack, V. Ferrari, B. Leibe, and L. Van Gool. Efficient mining of frequent and distinctive feature configurations. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
-  K. Rematas, B. Fernando, F. Dellaert, and T. Tuytelaars. Dataset fingerprints: Exploring image collections through data mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4867–4875, 2015.
-  P. Sermanet, A. Frome, and E. Real. Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054, 2014.
-  M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1143–1151, 2015.
-  M. Simon, E. Rodner, and J. Denzler. Part detector discovery in deep convolutional neural networks. In Asian Conference on Computer Vision, pages 162–177. Springer, 2014.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  M. Sun, Y. Yuan, F. Zhou, and E. Ding. Multi-attention multi-class constraint for fine-grained image recognition. arXiv preprint arXiv:1806.05372, 2018.
-  J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
-  A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, pages 689–692. ACM, 2015.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
-  D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang. Multiple granularity descriptors for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2399–2406, 2015.
-  Y. Wang, J. Choi, V. Morariu, and L. S. Davis. Mining discriminative triplets of patches for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1163–1172, 2016.
-  X.-S. Wei, C.-W. Xie, J. Wu, and C. Shen. Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76:704–714, 2018.
-  T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 842–850, 2015.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
-  H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and D. Metaxas. Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1143–1152, 2016.
-  L. Zhang, Y. Yang, M. Wang, R. Hong, L. Nie, and X. Li. Detecting densely distributed graph patterns for fine-grained image categorization. IEEE Transactions on Image Processing, 25(2):553–565, 2016.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for fine-grained category detection. In European conference on computer vision, pages 834–849. Springer, 2014.
-  X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian. Picking deep filter responses for fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1134–1142, 2016.
-  Y. Zhang, X.-S. Wei, J. Wu, J. Cai, J. Lu, V.-A. Nguyen, and M. N. Do. Weakly supervised fine-grained categorization with part-based image representation. IEEE Transactions on Image Processing, 25(4):1713–1725, 2016.
-  B. Zhao, X. Wu, J. Feng, Q. Peng, and S. Yan. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia, 19(6):1245–1256, 2017.
-  H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In Int. Conf. on Computer Vision, volume 6, 2017.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.