Fine-grained recognition, also known as subcategory classification, has been actively studied in the past several years. In contrast to the traditional image category recognition, fine-grained recognition focuses on identifying sub-ordinate categories such as different species of birds. This rapidly growing subfield in image-based object recognition not only improves the performance of conventional methods, but also helps humans in specific domains, since some fine-grained categories can only be recognized by domain experts.
Traditional methods based on statistics of features calculated on the whole image  are limited for fine-grained recognition, because there mainly exist subtle differences across the sub-ordinate categories. More effective solutions need to firstly localize the objects and their critical parts and then utilize features computed in the local regions for recognition [2, 3, 4, 5]. In particular, the parts are often seen as discriminative regions, which are very important for capturing the subtle category differences. By focusing on the local regions, the effect of background clutter can also be largely alleviated, thus leading to outstanding recognition performance.
However, the large appearance variations that widely exist in the real-world make the task of object and part localization extremely challenging. The popular proposal-based localization approach like  is not ideal as the filtering process of thousands of proposed candidate regions per image is expensive. In addition, it is difficult to train a robust “filtering” model when the number of classes and parts becomes large.
In this paper, we propose a novel approach that iteratively localizes objects and parts for fine-grained recognition. It follows the data-driven idea and is therefore model-free. The key idea is to “transfer” location annotations from a few visually similar images retrieved in a large training dataset, where each image has bounding box annotations of both objects and important parts. One assumption in the data-driven approaches is that there exist a large amount of annotated data and, for most unseen test images, similar ones (in terms of both object and scene layout) can be found from the annotated training set so that the annotations can be reliably transferred to the unseen images. It is worth noting that this is not a very strong assumption in the big data era and similar pipelines have been successfully adopted in several related problems like image annotation  and human motion analysis .
In the data-driven localization process, we adopt an iteration based strategy to gradually focus on the target objects and the parts. As shown in Figure 1, our approach first locates a large bounding box of the bird object and then gradually adjusts the output towards a more precise localization boundary. The same method is also adopted to locate the parts. This iteration strategy is empirically found to be more effective than the existing method of one-step localization .
The motivation behind our approach is very simple: when humans are given a visual scene, normally we obtain a gist of the scene first, and then gradually focus on specific objects. Object parts are probably only needed to be browsed or checked carefully if we want to understand the detailed properties of the object,e.g., the known clues to identify a particular species of bird. This biological visual perception procedure is simulated in the proposed approach for machine recognition of fine-grained categories.
2 Related Work
Fine-grained recognition has been extensively investigated recently. Most works used bird species categorization as the test case [10, 3, 2, 4, 11], and some used leafs , flowers  and dog breeds . Technically, one way to tackle the problem is to directly apply visual classification methods commonly used for standard object categorization. However, these approaches are incapable of capturing the subtle differences across the fine-grained categories. Thus, part-based approaches, which focus on extracting features in discriminative object parts, have become popular [2, 9]. One limitation of these approaches like  is that they adopted a similar pipeline as  for object/part detection, which relies on complex models that are difficult to be trained.
. For instance, a two-level attention model was proposed in. In addition, several researchers also explored the idea of human interaction based techniques [16, 17], which requires more manual inputs.
The main contribution of our work is the iterative data-driven approach for both object and part localization. A few existing works have also adopted the data-driven idea for localization, but used the one-step transfer process (without iteration) and many of them assumed that the object bounding boxes are given in the test images 
. The key difference is that we utilize the iteration based strategy to gradually transfer object and part locations without requiring bounding box annotations at test time. A few researchers have investigated the idea of iterative learning in other problems like human pose estimation.
3 The Proposed Approach
We employ an iterative approach to process an image progressively from global to local regions. Our approach first locates the spatial areas of the objects and the object parts in the images. After that, we apply recognition models on the localized objects (and their parts) for category recognition. In the first step of localization, we adopt a data-driven scheme that reaches the goal by migrating information from similar images, where detailed category information is not needed. Specially, two levels of iterations are required in the localization step, which are elaborated in the following.
3.1.1 Object-level Transfer
We first use an iterative transfer scheme to locate the interested object in an input image. Figure 2
shows a single round of the transfer pipeline. The first step is to extract image features at multiple scales. For this, we adopt the popular convolutional neural networks (CNNs) using a publicly available network model called VGGNet. We follow the recent work of  to extract the CNN features, where an spatial pyramid pooling (SPP) layer is added on top of the last convolutional layer, which pools features and generates fixed-length outputs.
Based on the features computed from the input image, we retrieve a small set of nearest neighbors in the training dataset, where the images are labeled with both object and part locations. Next, the location annotations from the similar training images are transferred to the input image. Since the images and the objects are of different sizes, we propose a simple bounding box fusion method so that the location annotations from multiple training images can be combined to produce the bounding box for the input image.
Specifically, the bounding box fusion process is executed by mapping all the images into a common space and then merging the boxes. Given an input image with its corresponding size information , we have a candidate set of images, denoted by where and are the bounding box annotations and the sizes of candidate images. All the images are resized into a uniform size with the bounding box locations updated according to the new size, denoted by . We then take the union of the bounding boxes as the fused bounding box of the input image. Finally, this fused box can be mapped back according to the original size of the input image as the output of this iteration. Notice that union is used as we found it more effective than average or intersection fusion, because it maximizes the likelihood of containing the entire object.
After receiving the bounding box from the first iteration, we update the input image by cropping out only the object areas, with which we proceed to perform the next iteration to generate a more precise bounding box. Before performing the next iteration, we also crop all the training images so that they can be matched more accurately with the input image. This is done by treating each training image as an input image, and using the rest to transfer the bounding boxes. In order to ensure that all the cropped training images contain the entire objects, we adjust the cropped area using the bounding box annotations, as visualized in Figure 3.
There are multiple ways to terminate this iteration process. One way is to stop when the bounding box does not change significantly across different iterations. As the bounding boxes are eventually used for recognition in our problem, we adopt a different strategy when the prediction score from a raw classifier trained on entire images (not the detected bounding boxes) is higher than a pre-defined threshold. This is easy to implement and was found slightly better.
3.1.2 Part-level Transfer
The object-level bounding boxes are not sufficient for fine-grained recognition as the differences across some categories may only lie in very small object parts. Harnessing features computed on such parts will be very helpful, which has been validated by several previous studies [3, 21, 2, 5, 22]. In this work, we execute a similar iterative process like the object-level transfer to locate critical object parts.
Our part-level transfer pipeline is shown in Figure 4. In this pipeline, we take the localized object as input and compare against the objects in the training set. Similar images are found based on matching the same CNN features. The part-level bounding boxes are fused in the same way as we fuse the object-level bounding boxes. This process can be iteratively executed to achieve a good localization of parts.
We underline that our localization approach is quite different from the proposal-based methods , which extract thousands of candidate boxes in one image and filters all of them to pick the most possible object bounding box(es). Our method relies on a purely data-driven method, which is much easier to be implemented and, as will be shown later, performs even better.
3.1.3 Bounding Box Refinement with Regression
The bounding boxes obtained by the proposed two-level iterative process are good but there is still room for improvement. A popular measure to evaluate the quality of object/part localization is Intersection-over-Union, which computes the percentage of the overlapped region between the detected box and the ground-truth box over the union of the two boxes. Figure 5 gives a few examples, where we see that the measure is pretty low for small boxes like the head of the birds.
We use a simple bounding box regression method to mitigate the deviation. Based on the object and part bounding boxes obtained by the iterative process, we predict refined bounding boxes using a class-specific bounding box regressor. This is similar to the method used in previous works like R-CNN  and deformable part models .
More formally, our goal is to learn a transformation that maps the predicted box to the corresponding ground-truth box. Suppose there are training pairs , where denotes the coordinates of the predicted boxes together with width and height and denotes the ground-truth boxes.
Following , the transformation is parameterized using four functions , , and , where the first two refer to the scale invariant translation of the box coordinates (upper-left corner) and the last two indicate the log-scale translations of the width and height of the box. Once these functions are learned, the refined box (predicted ground-truth) can be obtained by: , , , .
Each function is modeled in linear form with the CNN features as input: , where indicates one of and is the CNN feature.
is the vector of parameters, which are learned by optimizing the following objective function:
where the regression target is defined as if is or and if is or .
After the iterative transfer and the bounding box refinement, we arrive at a set of object and part bounding boxes for each input image containing an interested object111In practice, an image without a target interested object may be excluded at the localization stage if it has small matching similarity scores with the training images.. To recognize the specific type or class of the object, we also adopt the CNN features computed in each object/part bounding box. The VGGNet model 
is adopted with parameters fine-tuned using the image patches in the bounding boxes. Features extracted by the fine-tuned CNN model from different boxes are concatenated to train one-vs-all linear SVM classifiers for final prediction. Notice that this simple feature concatenation based recognition method has been adopted by several previous works[9, 2]. Advanced fusion methods that automatically learn the weights of each feature  may lead to better performance.
4.1 Dataset and Evaluation
CUB200-2011 (a.k.a. Caltech-UCSD Bird) dataset contains 11,788 images of 200 bird species. Each image in CUB200-2011 is annotated with bounding boxes of both object (bird) and parts. We adopt two part boxes in the experiments: head and body, following the protocol of .
Birdsnap is a much larger dataset with 49,829 images spanning 500 species of North American birds. Each image has detailed location annotations and additional attribute labels such as male, female, immature, etc. In this work, we only adopt the location annotations.
For both datasets, localization accuracy is measured by the percentage of correctly localized parts (PCP). A detected part is considered as a correct hit only when its Intersection-over-Union value with the ground-truth is larger than a threshold. Object-level localization results are not discussed as localizing parts is a more difficult task, and once parts are correctly localized, object localization is most likely to be correct.
For the final recognition results, we also use accuracy as the performance measure, which is the percentage of samples with correctly recognized bird species.
4.2 Results on CUB200-2011
We first report and discuss results evaluating in isolation the ability of our approach to accurately localize parts, which are summarized in Table 1 and Table 2. After that, we present recognition results using different kinds of inputs in Table 3 and compare with the state of the arts in Table 4.
4.2.1 Part Localization
In Table 1, we summarize the results using different M, i.e., the number of nearest training images used for bounding box transfer. For all the evaluated overlapping thresholds, seems a good option. Using a single most similar image in the training set and copying its bounding boxes is not precise enough, while using too many training images may involve noise from the less similar ones. The head part localization results are lower than that of body as heads are smaller and a small location shift away from the ground-truth may affect significantly on the overlapping ratio (see Figure 5).
We also evaluate the results of part localization by assuming the object-level bounding box is given. Results are reported in Table 2, together with the results of two compared representative approaches: DPM  and R-CNNs . We see that the results of both compared approaches are significantly better when the object-level bounding boxes are given. In contrast, our approach holds the very appealing advantage of not requiring the oracle object-level boxes as inputs—-the performance of not knowing the oracle object bounding boxes is similar under most settings. Figure 6 shows several examples of our localization results.
Compared with the two alternative approaches, we obtain significantly better results for the body part and lower accuracy for the head part (using the same overlapping threshold 0.5). The reason of our low performance of head detection is that we take the “union” of the training bounding boxes in the iterative transfer process, which normally produces larger boxes. This is fine for large parts like body, but for small parts, as discussed earlier, the Intersection-over-Union values of the predicted boxes are affected much more significantly. Note that the slightly larger bounding boxes from our approach turn out to be better in the recognition stage (see comparison of recognition results with the same approach in Table 4), which may be due to the fact that the ground-truth annotations are not very accurate and tend to be smaller than the real object parts in many cases.
|Methods||Oracle Box Given||Oracle Box Unknown|
|Strong DPM ||43.5||75.2||37.4||47.1|
|Part-based R-CNNs ||68.5||79.8||61.9||70.7|
|Input Image Region||Accuracy (%)|
|Object-level Box (Oracle)||79.1|
|Object-level Box (Ours)||76.9|
|Head Box (Ours)||67.4|
|Body Box (Ours)||74.0|
|Method||Train (Oracle)||Test (Oracle)||Feature||Accuracy (%)|
|Berg et al.  (CVPR13)||POOF||73.3|
|Zhang et al.  (ECCV14)||AlexNet||82.0|
|Branson et al.  (BMVC14)||AlexNet||85.4|
|Goring et al.  (CVPR14)||HOG||57.8|
|Gavves et al.  (ICCV13)||Fisher||62.7|
|Huang et al.  (CVPR16)||AlexNet||76.6|
|Zhang et al.  (ECCV14)||AlexNet||73.9|
|Branson et al.  (BMVC14)||AlexNet||75.7|
|Zhang et al.  (ICCV15)||VGGNet||81.6|
We first discuss results using features computed from different object/part bounding boxes, in order to understand the contribution of each image region in fine-grained recognition. As shown in Table 3, the accuracy of using features computed on the entire images (without localization) is worse than that relies on the features in the small head boxes (62.5% and 67.4% respectively). This indicates that using entire images is not reliable due to background clutter. Based on features computed in our predicted object or body bounding boxes, we achieve much better results.
Table 4 gives the result from fusing the features computed in our predicted bounding boxes and the entire images (the bottom row), and compares it with a large set of approaches proposed recently. Fusing the features offers a big leap in the recognition performance, which validates the fact that it is important to focus on both the object and its important parts for fine-grained recognition. The compared approaches are grouped into three categories: the first three adopted additionally the ground-truth object and part bounding boxes in the test set; the next three used the ground-truth object bounding boxes in the test set; and the following three using neither object nor part annotations in the test set. Our approach, which performs automatic localization of objects and parts at test time, offers very competitive results.
|One vs. Most + ST Prior ||66.6|
|Entire Image Classification||60.7|
|Object-level Box (Ours)||73.4|
4.3 Results on Birdsnap
Finally, we present results on the Birdsnap dataset in Table 5. We see that the recognition accuracy of using the entire images is just 60.7%, which is much lower than the result from the owner of the dataset . By adopting our iterative object localization and using features from the predicted boxes, the performance can be significantly improved to 73.4%, which again verifies the effectiveness of our approach. Notice that this large dataset does not contain part-level bounding box annotations and therefore the part-level transfer is not evaluated.
We have proposed a novel approach for object and part localization in fine-grained recognition tasks. Our approach follows a data-driven pipeline by iteratively transferring bounding boxes from similar training images. We show that such a simple approach can produce better localization results than the popular proposal-based methods that have to filter thousands of candidate bounding box proposals in each image. Using deep learning features computed in our predicted object/part bounding boxes, very competitive accuracies are obtained. The results indicate that it is very important to incorporate clues from objects and parts so that the subtle differences across the categories can be captured.
Acknowledgements This work was supported in part by two NSFC projects (#61572134 and #U1509206) and one grant from STCSM, China (#16QA1400500).
-  Navneet Dalal and Bill Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
-  Ning Zhang, Jeff Donahue, and et al., “Part-based r-cnns for fine-grained category detection,” in ECCV, 2014.
-  Thomas Berg and Peter N Belhumeur, “POOF: Part-Based One-vs-One Features for fine-grained categorization, face verification, and attribute estimation,” in CVPR, 2013.
-  Steve Branson, Grant Van Horn, and et al., “Bird species categorization using pose normalized deep convolutional nets,” in BMVC, 2014.
-  Ning Zhang, Ryan Farrell, and et al., “Deformable part descriptors for fine-grained recognition and attribute prediction,” in ICCV, 2013.
-  Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan, “Object detection with discriminatively trained part-based models,” PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
Antonio Torralba, Rob Fergus, and William T Freeman,
“80 million tiny images: A large data set for nonparametric object and scene recognition,”PAMI, vol. 30, no. 11, pp. 1958–1970, 2008.
-  Liu Ren, Alton Patrick, Alexei A Efros, Jessica K Hodgins, and James M Rehg, “A data-driven approach to quantifying natural human motion,” ACM TOG, vol. 24, no. 3, pp. 1090–1097, 2005.
-  Christoph Goring, Erid Rodner, and et al., “Nonparametric part transfer for fine-grained recognition,” in CVPR, 2014.
-  Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang, “Part-stacked cnn for fine-grained visual categorization,” in CVPR, 2016.
-  Jian Pu, Yu-Gang Jiang, Jun Wang, and Xiangyang Xue, “Which looks like which: Exploring inter-class relationships in fine-grained visual categorization,” in ECCV, 2014.
Neeraj Kumar, Peter N Belhumeur, and et al.,
“Leafsnap: A computer vision system for automatic plant species identification,”in ECCV, 2012.
-  Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie, “Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop,” in CVPR, 2016.
-  Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li, “Novel dataset for fine-grained image categoraization: Stanford dogs,” in CVPR Workshop on FGVC, 2011.
-  Tianjun Xiao, Yichong Xu, and et al., “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” in CVPR, 2015.
-  Catherine Wah, Steve Branson, Pietro Perona, and Serge Belongie, “Multiclass recognition and part localization with humans in the loop,” in ICCV, 2011.
-  Steve Branson, Pietro Perona, and et al., “Strong supervision from weak annotation: Interactive training of deformable part models,” in ICCV, 2011.
-  Joao Carreira, Pulkit Agrawal, and et al., “Human pose estimation with iterative error feedback,” arXiv preprint arXiv:1507.06550, 2015.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in ECCV, 2014.
-  Ryan Farrell, Om Oza, Ning Zhang, Vlad I Morariu, Trevor Darrell, and Larry S Davis, “Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance,” in ICCV, 2011.
-  Jiongxin Liu, Angjoo Kanazawa, David Jacobs, and Peter Belhumeur, “Dog breed classification using part localization,” in ECCV, 2012.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
-  Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue, “Exploring inter-feature and inter-class relationships with deep neural networks for video classification,” in ACM MM, 2014.
-  Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie, “The caltech-ucsd birds-200-2011 dataset,” California Institute of Technology, 2011.
-  Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L. Alexander, David W. Jacobs, and Peter N. Belhumeur, “Birdsnap: Large-scale fine-grained visual categorization of birds,” in CVPR, 2014.
-  Hossein Azizpour and Ivan Laptev, “Object detection using strongly-supervised deformable part models,” in ECCV, 2012.
-  Efstratios Gavves, Basura Fernando, and et al., “Fine-grained categorization by alignments,” in ICCV, 2013.
-  Dequan Wang, Zhiqiang Shen, Jie Shao, Wei Zhang, Xiangyang Xue, and Zheng Zhang, “Multiple granularity descriptors for fine-grained categorization,” in ICCV, 2015.