Fine-grained visual categorization (FGVC) refers to the task of indentifying objects from subordinate categories and is now an important subfield in object recognition. FGVC applications include, for example, recognizing species of birds [1, 2, 3], pets [4, 5], flowers [6, 7], and cars [8, 9]. Lay individuals tend to find it easy to quickly distinguish basic-level categories (e.g., cars or dogs), but identifying subordinate classes like ”Ringed-billed gull” or ”California gull” can be difficult, even for bird experts. Tools that aid in this regard would be of high practical value.
This task is made challenging due to the small inter-class variance caused by subtle differences between subordinaries and the large intra-class variance caused by negative factors such as differing pose, multiple views, and occlusions. However, impressive progress[10, 3, 11, 12, 13] has been made over the last few years and fine-grained recognition techniques are now close to practical use in various applications such as for wildlife observation and in surveillance systems.
Whilst numerous attempts have been made to boost the classification accuracy of FGVC [14, 15, 16, 17, 18], an important aspect of the problem has yet to be addressed, namely the ability to generate a human-understandable ”manual” on how to distinguish fine-grained categories in detail. For example, ecological protection volunteers would benefit from an algorithm that could not only accurately classify bird species but also provide brief instructions on how to distinguish very similar subspecies (a ”Ringed-billed” and ”California gull”, for instance, differ only in their beak pattern, see Figure 1), aided by some intuitive illustrative examples. Existing fine-grained recognition methods that aim to provide a visual field guide mostly follow a ”part-based one-vs.-one features” (POOFs) [19, 20, 3] routine or employ human-in-the-loop methods [21, 22, 23]
. However, since the amount of available data requiring interpretation is increasing drastically, a method that simultaneously implements and interprets FGVC using deep learning methods is now both possible and advocated.
It is widely acknowledged that the subtle differences between fine-grained categories mostly reside in the unique properties of object parts [25, 19, 15, 26, 27, 28]. Therefore, a practical solution to interpreting classification results as human-understandable manuals is to discover classification criteria from object parts. Some existing fine-grained datasets provide detailed part annotations including part landmarks and attributes [2, 9]
. However, they are usually associated with a large number of object parts, which incur a heavy computational burden for both part detection and classification. From this perspective, a method that follows an object part-aware strategy to provide interpretable prediction criteria at minimal computational effort but deals with large numbers of parts is desirable. In this scenario, independently training a large convolutional neural network (CNN) for each part and then combining them in a unified framework is impractical.
Here we address the fine-grained categorization problem not only in terms of accuracy and efficiency when performing subordinate-level object recognition but also with regard to the interpretable characteristics of the resulting model. We do this by learning a new part-based CNN for FGVC that models multiple object parts in a unified framework with high efficiency. Similar to previous fine-grained recognition approaches, the proposed method consists of a localization module to detect object parts (“where pathway”) and a classification module to classify fine-grained categories at the subordinate level (“what pathway”). In particular, our key point localization network structure is composed of a sub-network used in contemporary classification networks (AlexNet  and BN-GoogleNet ) and a 1x1 convolutional layer followed by a softmax layer to predict evidence of part locations. The inferred part locations are then fed into the classification network, in which a two-stream architecture is proposed to analyze images at both the object level (global information) and part level (local information). Multiple parts are then computed via a shared feature extraction route, separated directly on feature maps using a part cropping layer, concatenated, and then fed into a shallower network for object classification. Except for categorical predictions, our method also generates interpretable classification instructions based on object parts. Since the proposed deeper network architecture-based framework employs a sharing strategy that stacks the computation of multiple parts, we call it Deeper Part-Stacked CNN (DPS-CNN).
This paper makes the following contributions:
DPS-CNN is the first efficient framework that not only achieves state-of-the-art performance on Caltech-UCSD Birds-200-2011 but also allows interpretation;
We explore a new paradigm for key point localization, which has exceed state of the art performance on Birds-200-2011 dataset;
Our classification network follows a two-stream structure that captures both object level (global) and part level (local) information, in which a new share-and-divide strategy is presented to compute multiple object parts. As a result, the proposed architecture is very efficient with a capacity of frames/sec 111For reference, a single CaffeNet runs at frames/sec under the same experimental setting. without sacrificing the fine-grained categorization accuracy. Also, we propose a new strategy called scale mean-max (SMM) for feature fusion learning.
This paper is not a direct extension of our previous work  and several other state-of-the-art fine-grained classification models [31, 32, 33, 17] but a significant development regarding the following aspects: Different to  who adapts FCN for part localization, we propose a new paradigm for key point localization that first samples a small number of representable pixels and then determine their labels via a convolutional layer followed by a softmax layer; We also propose a new network architecture and enrich the methodology used in ; Further, we introduce a simple but effective part feature encoding (named Scale Average Max) method in contrast to Bilinear in , Spatially Weighted Fisher Vector in , and Part-based Fully Connected in .
The remainder of the paper is organized as follows. Related works are summarized in Section 2, and the proposed architecture including the localization and classification networks is described in Section 3. Detailed performance studies and analysis are presented in Section 4, and in Section 5 we conclude and propose various applications of the proposed DPS-CNN architecture.
2 Related Work
Keypoint Localization. Subordinate categories generally share a fixed number of semantic components defined as ’parts’ or ’key points’ but with subtle differences in these components. Intuitively, when distinguishing between two subordinate categories, the widely accepted approach is to align components containing these fine differences. Therefore, localizing parts or key points plays a crucial role in fine-grained recognition, as demonstrated in recent works [19, 34, 26, 27, 35, 36].
. For example, the active shape model (ASM) uses a mixture of Gaussian distributions to model the shape. Although these techniques provide an effective way to locate facial landmarks, they cannot usually handle a wide range of differences such as those seen in bird species recognition. The other group of methods[16, 41, 42, 27, 43, 44, 32, 45] trains a set of key point detectors to model local appearance and then uses a spatial model to capture their dependencies and has become more popular in recent years. Among them, the part localization method proposed in [43, 44, 32] is most similar to ours. In , a convolutional sub-network is used to predict the bounding box coordinates without using a region candidate. Although its performance is acceptable because the network is learned by jointly optimizing the part regression, classification, and alignment, all parts of the model need to be trained separately. To tackle this problem,  and  adopt the similar pipeline of Fast R-CNN, in which part region candidates are generated to learn the part detector. In this work, we discard the common proposal-generating process and regard all receptive field centers 222Here the receptive field means the area of the input image, to which a location in a higher layer feature map correspond. of a certain intermediate layer as potential candidate key points. This strategy results in a highly efficient localization network, since we take advantage of the natural properties of CNNs to avoid the process of proposal generation.
Our work is also inspired by and inherited from fully convolutional networks (FCNs) , which produces dense predictions with convolutional networks. However, our network structure is best regarded as a fast and effective approach to predict sparse pixels since we only need to determine the class labels of the centers of the receptive fields of interest. Thus, FCN is more suited to segmentation, while our framework is designed for sparse key point detection. As FCN aims to predict intermediate feature maps then upsample them to match the input image size for pixel-wise prediction. Recent works [31, 48] borrow this idea directly for key point localization. During training, both of these works resize the ground truths to the size of the output feature maps and then use them to supervise the network learning, while, during testing, the predicted feature maps are resized to match the input size to generate the final key point prediction. However, these methods cannot guarantee accurate position prediction due to the upsampling process.
Fine-Grained Visual Categorization.A number of methods have been developed to classify object categories at the subordinate level. The best performing methods have gained performance improvements by exploiting the following three aspects: more discriminative features (including deep CNNs) for better visual representation [49, 50, 24, 51, 52]; explicit alignment approaches to eliminate pose displacements [16, 53]; and part-based methods to examine the impact of object parts [19, 34, 26, 27, 35, 36]. Another approach has been used to explore human-in-the-loop methods [54, 14, 55] to identify the most discriminative regions for classifying fine-grained categories. Although such methods provide direct and important information about how humans perform fine-grained recognition, they are not scalable due to the need for human interactions during testing. Of these, part-based methods are thought to be most relevant to fine-grained recognition, in which the subtle differences between fine-grained categories mostly relate to the unique object part properties.
Some part-based methods [19, 27] employ strong annotations including bounding boxes, part landmarks, or attributes from existing fine-grained recognition datasets [2, 5, 9, 11]. While strong supervision significantly boosts performance, the expensive human labelling process motivates the use of weakly-supervised fine-grained recognition without manually labeled part annotations, , discovering object parts in an unsupervised fashion [56, 12, 17]. Current state-of-the-art methods for fine-grained recognition include  and 
, which both employ deep feature encoding method, while DPS-CNN is largely inherited from, who first detected the location of two object parts and then trained an individual CNN based on the unique properties of each part. Compared to part-based R-CNN, the proposed method is far more efficient for both detection and classification. As a result, we can use many more object parts than , while still maintaining speed during testing.
Lin , argued that manually defined parts were sub-optimal for object recognition and thus proposed a bilinear model consisting of two streams whose roles were interchangeable as detectors or features. Although this design exploited the data-driven approach to possibly improve classification performance, it also made the resulting model difficult to interpret. In contrast, our method attempts to balance the need for classification accuracy and model interpretability in fine-grained recognition systems.
3 Deeper Part-Stacked CNN
A key motivation of our proposed method is to produce a fine-grained recognition system that not only considers recognition accuracy but also addresses efficiency and interpretability. To ensure that the resulting model is interpretable, we employ strong part-level annotations with the potential to provide human-understandable classification criteria. We also adapt several strategies such as sparse prediction instead of dense prediction to eliminate part proposal generation and to share computation for all part features. For the sake of classification accuracy, we learn a comprehensive representation by incorporating both global (object-level) and local (part-level) features. Based on these, in this section we present the model architecture of the proposed Deeper Part-Stacked CNN (DPS-CNN).
According to the common framework for fine-grained recognition, the proposed architecture is decomposed into a localization network (Section 3.1) and a classification network (Section 3.2). In our previous work , we adopted CaffeNet , a slightly modified version of the standard seven-layer AlexNet architecture , as the basic network structure. In this paper, we use a deeper but more powerful network (BN-GoogleNet)  as a substitute. A unique feature of our architecture is that the message transferring operation from the localization network to the classification network, which uses the detected part locations to perform part-based classification, is conducted directly on the Inception-4a output feature maps within the data forwarding process. This is a significant departure from the standard two-stage pipeline of part-based R-CNN, which consecutively localizes object parts and then trains part-specific CNNs on the detected regions. Based on this design, sharing schemes are performed to make the proposed DPS-CNN fairly efficient for both learning and inference. Figure 2 illustrates the overall network architecture.
3.1 Localization Network
The first stage in our proposed architecture is a localization network that aims to detect the location of object parts. We employ the simplest form of part landmark annotation, where a 2D key point is annotated at the center of each object part. Assume that - the number of object parts labeled in the dataset is sufficiently large to offer a complete set of object parts in which fine-grained categories are usually different. A naive approach to predicting these key points is to directly apply FCN architecture  for dense pixel-wise prediction. However, this method usually biases the learned predictor because, in this task and unlike semantic segmentation, the number of key point annotations is extremely small compared to the number of irrelevant pixels.
Motivated by the recent progress in object detection  and semantic segmentation , we propose to use the centers of receptive fields as key point candidates and use a fully convolutional network to perform sparse pixel prediction to locate the key points of object parts (see Figure 4(b)). In the field of object detection, box candidates expected to be likely objects are first extracted using proposal-generating methods such as selective search  and region proposal networks . Then, CNN features are learned to represent these box candidates and finally used to determine their class label. We adapt this pipeline to key point localization but omit the candidate generation process and simply treat the centers of receptive fields corresponding to a certain layer as candidate points. As shown in Figure 4(a), the advantage of using this method is that each candidate point can be represented by a cross-channel feature vector in the output feature maps. Also, in our candidate point evaluation experiments in Table I, we find that given an input image of size x and using the receptive fields of the inception-4a layer in BN-GoogleNet generates x candidate points and % recall at PCK@.
Fully convolutional network. An FCN is achieved by replacing the parameter-rich fully connected layers in standard CNN architectures constructed by convolutional layers with kernels of spatial size . Given an input RGB image, the output of an FCN is a feature map of reduced dimension compared to the input. The computation of each unit in the feature map only corresponds to pixels inside a region of fixed size in the input image, which is called its feature map. We prefer FCNs because of the following reasons: (1) feature maps generated by FCNs can be directly utilized as the part locating results in the classification network, as detailed in Section 3.2; (2) the results of multiple object parts can be obtained simultaneously; (3) FCNs are very efficient for both learning and inference.
Learning. We model the part localization process as a multi-class classification problem on sparse output spatial positions. Specifically, suppose the output of the last FCN convolutional layer is of size , where and are spatial dimensions and is the number of channels. We set . Here, is the number of object parts and denotes an additional channel to model the background. To generate corresponding ground-truth labels in the form of feature maps, units indexed by spatial positions are labeled with their nearest object part; units that are not close to any of the labeled parts (with an overlap with respect to a receptive field) are labeled as background. In this way, ground-truth part annotations are transformed into the form of corresponding feature maps, while in recent works that directly apply FCNs [31, 48], the supervision information is generated by directly resizing the part ground-truth image.
Another practical problem here is determining the model depth and the input image size for training the FCN. Generally, layers at later stages carry more discriminative power and, therefore, are more likely to generate good localization results; however, their receptive fields are also much larger than those of previous layers. For example, the receptive field of the inception-4a layer in BN-GoogleNet has a size of compared to the input image, which is too large to model an object part. We propose a simple trick to deal with this problem, namely upsampling the input images so that the fixed size receptive fields denoting object parts become relatively smaller compared to the whole object, while still using later stage layers to guarantee discriminative power. In the proposed architecture, the input image is upsampled to double the resolution and the inception-4a layer is adopted to guarantee discrimination.
The localization network is illustrated in Figure 5. The input images are warped and resized into a fixed size of . All layers from the beginning to the inception-4a layer are cut from the BN-GoogleNet architecture, so the output size of the inception-4a layer is . Then, we further introduce an convolutional layer with outputs termed conv
for classification. By adopting a location-preserving softmax that normalizes predictions at each spatial location of the feature map, the final loss function is a sum of softmax loss at allpositions:
Here, is the part label of the patch at location , where the label denotes background. stands for the output of conv layer at spatial position and channel .
Inference. Inference starts from the output of the learned FCN, , part-specific heat maps of size , in which we introduce a Gaussian kernel to remove isolated noise in the feature maps. The final output of the localization network are locations in the conv feature map, each of which is computed as the location with the maximum response for one object part.
Meanwhile, considering that object parts may be missing in some images due to varied poses and occlusion, we set a threshold that if the maximum response of a part is below , we simply discard this part’s channel in the classification network for this image. Let , the inferred part locations are given as:
3.2 Classification network
The second stage of the proposed DPS-CNN is a classification network with the inferred part locations given as an input. As shown in Figure 2, it follows a two-stream architecture with a Part Stream and a Object Stream to capture semantics from different angles. The outputs of both two streams are fed into a feature fusion layer followed by a fully connected layer and a softmax layer.
Part stream. The part stream is the core of the proposed DPS-CNN architecture. To capture object part-dependent differences between fine-grained categories, one can train a set of part CNNs, each one of which conducts classification on a part separately, as proposed by Zhang . Although such method works well for situations employing two object parts , we argue that this approach is not applicable when the number of object parts is much larger, as in our case, because of the high time and space complexities.
We introduce two strategies to improve part stream efficiency, the first being model parameter sharing. Specifically, model parameters of layers before the part crop layer and inception-4e are shared among all object parts and can be regarded as a generic part-level feature extractor. This strategy reduces the number of parameters in the proposed architecture and thus reduces the risk of overfitting. We also introduce a part crop layer as a computational sharing strategy. The layer ensures that the feature extraction procedure of all parts only requires one pass through the convolutional layers.
After performing the shared feature extraction procedure, the computation of each object part is then partitioned through a part crop layer to model part-specific classification cues. As shown in Figure 2, the input for the part crop layer is a set of feature maps (the output of inception-4a layer in our architecture) and the predicted part locations from the previous localization network, which also reside in inception-4a feature maps. For each part, the part crop layer extracts a local neighborhood centered on the detected part location. Features outside the cropped region are simply discarded. In practice, we crop neighborhood regions from the inception-4a feature maps. The cropped size of feature regions may have an impact on recognition performance, because larger crops will result in redundancy when extracting multiple part features, while smaller crops cannot guarantee rich information. For simplicity, we use in this paper to ensure that the resulting receptive field is large enough to cover the entire part.
Object stream. The object stream captures object-level semantics for fine-grained recognition. It follows the general architecture of BN-GoogleNet, in which the input of the network is a RGB image and the output of incenption-5b layer are feature maps. Therefore, we use average pooling instead of in original setting.
The design of the two-stream architecture in DPS-CNN is analogous to the famous Deformable Part-based Models , in which object-level features are captured through a root filter in a coarser scale, while detailed part-level information is modeled by several part filters at a finer scale. We find it critical to measure visual cues from multiple semantic levels in an object recognition algorithm.
We conduct the standard gradient descent to train the classification network. It should be noted, however, that the gradient of each element in inception-4a feature maps is calculated by the following equation:
where is the loss function, is the feature maps cropped by part and
Specifically, the gradient of each cropped part feature map (in spatial resolution) is projected back to the original size of inception-4a ( feature maps) according to the respective part location and then summed.
The computation of all other layers simply follows the standard gradient rules.
Note that the proposed DPS-CNN is implemented as a two stage framework, after training the FCN, weights of the localization network are fixed when training the classification network.
The commonest method [44, 27] for combining all part-level and object-level features is to simply concatenate all these feature vectors as illustrated in Figure 3(a). However, this approach may cause feature redundancy and also suffer from high-dimensionality when part numbers become large. To effectively utilize all part- and object-level features, we present three options for learning fusion features: scale sum (SS), scale max (SM), and scale mean-max (SMM), as illustrated in Figure 3(a), Figure 3(b), and Figure 3(d), respectively. All three methods include the shared process of placing a scale layer on top of each branch. Nevertheless, as indicated by their names, the scale sum feature is the element-wise sum of all output branches, the scale max feature is generated by an element-wise maximum operation, while the scale average-max feature is the concatenation of element-wise mean and max features. In our previous work  based on the standard CaffeNet architecture, each branch from the part stream and the object stream was connected with an independent fc6 layer to encourage diversity features, and the final fusion feature was the sum of all the outputs of these fc6 layers. As this fusion process requires times model parameters more than the original fc6 layer in CaffeNet and consequently incurs a huge memory cost, a
convolutional layer is used for dimensionality reduction. Here we redesign this component for simplicity and to improve performance. First, a shared inception module is placed on top of the cropped part region to generate higher level features. Also, a scale layer follows each branch feature to encourage diversity between parts. Furthermore, the scale layer has fewer parameters than the fully connected layer and, therefore, reduces the risk of overfitting and decreases the model storage requirements.
In this section we present experimental results and a thorough analysis of the proposed method. Specifically, we evaluate the performance from four different aspects: localization accuracy, classification accuracy, inference efficiency, and model interpretation.
4.1 Dataset and implementation details
Experiments are conducted on the widely used fine-grained classification benchmark the Caltech-UCSD Birds dataset (CUB-200-2011) . The dataset contains bird categories with roughly training images per category. In the training phase we adopt strong supervision available in the dataset, we employ 2D key point part annotations of altogether object parts together with image-level labels and object bounding boxes.
The labeled parts333The object parts are back, beak, belly, breast, crown, forehead, left eye, left leg, left wing, nape, right eye, right leg, right wing, tail, and throat. imply places where people usually focus on when being asked to classify fine-grained categories; thus they provide valuable information for generating human-understandable systems.
The proposed Deeper Part-Stacked CNN architecture is implemented using the open-source package Caffe. Specifically, input images are warped to a fixed size of , randomly cropped into , and then fed into the localization network and the part stream in the classification network as input. We employ a pooling layer with kernel to guarantee synchronization between the two streams in the classification network.
4.2 Candidate keypoints
For the key point localization task, we follow the proposal-based object detection method pipeline; centers of receptive fields corresponding to a certain layer are first regarded as candidate points and then forwarded to a fully convolutional network for further classification. Similar to object detection using proposals, whether selected candidate points have a good coverage of pixels of interest in the test image plays a crucial role in key point localization, since missed key points cannot be recovered in subsequent classification. Thus, we first evaluate the candidate point sampling method. The evaluation is based on the PCK metric , in which the error tolerance is normalized with respect to the input image size. For consistency with evaluation of key point localization, a ground truth point is recalled if there exists a candidate point matched in terms of the PCK metric. Table I shows the localization recall of candidate points selected by inception-4a with different values , and . As expected, candidate points sampled by layer inception-4a have a great coverage of ground truth using PCK metric with . However, the recall drop dramatically when using
. This mainly because of the large stride(16) ininception-4a layer, which results in the distance between two closest candidate points is 16 pixels, while setting a input size of with requires the candidate point should be close to the ground truth within pixels.
4.3 Localization Results
Following , we consider a key point to be correctly predicted if the prediction lies within a Euclidean distance of times the maximum of the input width and height compared to the ground truth. Localization results are reported on multiple values of in the analysis below. The value in the PCK metric is introduced to measure the error tolerance in key point localization. To investigate the effect of the selected layer for key point localization, we perform experiments using the inception-4a,inception-4b,inception-4c and inception-4d layers as part detector layers. As shown in Table III, a higher layer with a larger receptive field tends to achieve better localization performance than a lower layer with . This is mainly because the larger receptive fields are crucial for capturing spatial relationships between parts and improve performance (see Table II). However, in contrast, for or , the performance decreases at deeper layers. One possible explanation is that although higher layers obtain better semantic information about the object, they lose more detailed spatial information. To evaluate the effectiveness of our key point localization approach, we also compare it with recent published works [30, 31, 45]
providing PCK evaluation results on CUB-200-2011 along with experimental results using a more reasonable evaluation metric called average precision of key points (APK), which correctly penalizes both missed and false-positive detections. As can be seen from the Table III, our method outperforms existing techniques with various setting in terms of PCK. In addition, the most striking result is that our approach outperforms the compared methods with large margins when using small value.
The part localization architecture adopted in DPS-CNN achieves a highest average PCK@0.1 on the CUB-200-2011 test set for object parts. Specifically, the employed Gaussian smoothing kernel delivers improvements over methods that use standard convolutional layers in BN-GoogleNet.
Another interesting phenomenon of note is that parts residing near the birds’ heads tend to be located more accurately. It turns out that a bird’s head has a relatively stable structure with fewer deformations and a lower probability of occlusion. In contrast, parts that are highly deformable such as the wings and legs get lower PCK values. Figure6 shows typical localization results using the proposed method.
4.4 Classification results
We begin our classification analysis by studying the discriminative power of each object part. We select one object part each time as the input and discard the computation of all other parts. As shown in Table IV, different parts produce significantly different classification results. The most discriminative part ”Throat” achieves a quite impressive accuracy of , while the lowest accuracy is for the part ”Tail”. Therefore, to improve classification, it may be beneficial to find a rational combination or order of object parts instead of directly running the experiment on all parts altogether. More interestingly, when comparing the results between Table III and Table IV it can be seen that parts located more accurately such as Throat, Nape, Forehead and Beak tend to achieve better performance in the recognition task, while some parts like Tail and Left Leg with poor localization accuracy perform worse. This observation may support the hypothesis that a more discriminative part is easier to locate in the context of fine-grained categorization and vice versa.
To evaluate our framework’s overall performance, we first train a baseline model with accuracy using a BN-Inception architecture 
with pre-training on ImageNet. By stacking certain part features and applying our proposed fusion method, our framework improves the performance to . Also, to evaluate our proposed feature fusion method, we then train four DPS-CNN models with same experimental settings (maximum iteration and learning rate) but using different feature fusion methods. The results shown in Table V (Rows 2-5) demonstrate that SMM fusion achieves the best performance and outperforms the FC method by .
To investigate which parts should be selected in our learning framework, we conduct the following experiments by employing two guiding principles: one concerns the feature discrimination and the other feature diversity. Here we consider parts with higher accuracy in Table IV are more discriminative, and combination of parts with distant location are more diverse. We firstly select top 6 parts with the highest accuracy from Table IV by only applying the discriminative principle, then choose 3,5,9 and 15 parts respectively by taking two principles into account. Experimental results are shown in Table V (Row 6-10), we observe that increasing part numbers generally bring slight improvement. However, all setting perform better than that with 6 most discriminative parts. This mainly because most of these parts are adjacent to each other so that it fails to produce diverse feature in our framework. Also, it should be noticed that using all parts feature does not guarantee the best performance, on the other hand, results in pool accuracy. This finding shows that the feature redundancy caused by appending exorbitant number of parts in learning, may degrade the accuracy, and suggests that an appropriate strategy for integrating multiple parts is critical.
|2||5-parts + FC||81.86|
|3||5-parts + SS||83.06|
|4||5-parts + SM||83.41|
|5||5-parts + SMM||83.55|
|6||6-parts + SMM||84.12|
|7||3-parts + SMM||84.29|
|8||5-parts + SMM||84.91|
|9||9 parts + SMM||85.12|
|10||15-parts + SMM||84.45|
|Method||Training||Testing||Pre-trained Model.||FPS 444Only FPS (Frames Per Second) results that has been reported by the authors are shown in this table.||Acc(%).|
|Part-Stacked CNN ||✓||✓||✓||AlexNet||20||76.62|
|Deep LAC ||✓||✓||✓||AlexNet||-||80.26|
|Part R-CNN ||✓||✓||✓||AlexNet||-||76.37|
Part R-CNN  without BBox
|PoseNorm CNN ||✓||✓||AlexNet||-||75.70|
|Bilinear-CNN (M+D+BBox) ||✓||✓||VGG16+VGGM||8||85.10|
|Bilinear-CNN (M+D) ||VGG16+VGGM||8||84.10|
|Spatial Transformer CNN ||Inception+BN||-||84.10|
|DPS-CNN with 9 parts||✓||Inception+BN||32||85.12|
|DPS-CNN ensemble with 4 models||✓||Inception+BN||8||86.56|
We also present the performance comparison between DPS-CNN and existing fine-grained recognition methods. As can be seen in Table VI, our approach using only keypoint annotation during training achieve 85.12% accuracy which is comparable with the state-of-the-art method  that achieves 85.10% using bounding box both in training and testing. Moreover, it is interpretable and faster - the entire forward pass of DPS-CNN runs at (NVIDIA TitanX), while B-CNN[D,M] runs at (NVIDIA K40)555note that the computational power of TitanX is around 1.5 times of that of K40).. In particular, our method is much faster than proposal based methods such as  and  which require multiple network forward propagation for proposal evaluation, while part detection and feature extraction are accomplished efficiently by running one forward pass in our approach. In addition, we combine four models stemmed from integrating different parts(listed in Table V (Row 7-10)) to form an ensemble which leads to 86.56% accuracy on cub-200-2011.
To understand what features are learned in DPS-CNN, we use the aforementioned five-parts model and show its feature map visualization compared with that from BN-Inception model fine-tuning on cub-200-2011. Specifically, we pick the top six scoring feature maps of Inception-4a layer for visualization, where the score is the sum over each feature map. As shown in Figure 7, each example image from test set is followed by three rows of feature maps, from top row to bottom, which are selected from part stream, object stream and BN-inception base-line network respectively. Interestingly, by comparison, our part stream have learned feature maps that appear to be more intuitive than those learned by the other two methods. Specifically, it yields more focused and cleaner patterns which tend to be highly activated by the network. Moreover, we can observe that object stream and baseline network are more likely to activate filters with extremely high frequency details but at the expense of extra noise, while part stream tends to obtain a mixture of low and mid frequency information. The red dashed box in Figure 7 indicates a failure example, in which both our part stream and object stream fails to learn useful feature. This may be caused by our part localization network fails to locate Crown and Left Leg parts because the branch in this image looks similar to bird legs and another occluded bird also has an effect on locating the Crown part.
4.5 Model interpretation
One of the most prominent features of DPS-CNN method is that it can produce human-understandable interpretation manuals for fine-grained recognition. Here we directly borrow the idea from  for interpretation using the proposed method.
Different from  who directly conducted one-on-one classification on object parts, the interpretation process of the proposed method is conducted relatively indirectly. Since using each object part alone does not produce convincing classification results, we perform the interpretation analysis on a combination of bounding box supervision and each single object part. The analysis is performed in two ways: a ”one-versus-rest” comparison to denote the most discriminative part to classify a subcategory from all other classes, and a ”one-versus-one” comparison to obtain the classification criteria of a subcategory with its most similar classes.
The “one-versus-rest” manual for an object category . For every part , we compute the summation of prediction scores of the category’s positive samples. The most discriminative part is then captured as the one with the largest accumulated score:
The “one-versus-one” manual obtained by computing as the part which results in the largest difference of prediction scores on two categories and . We first take the respective two rows in the score matrix , and re-normalize it using the binary classification criterion as . Afterwards, the most discriminative part is given as:
The model interpretation routine is demonstrated in Figure 8. When a test image is presented, the proposed method first conducts object classification using the DPS-CNN architecture. The predicted category is presented as a set of images in the dataset that are closest to the test image according to the feature vector of each part. Except for the classification results, the proposed method also presents classification criteria that distinguish the predicted category from its most similar neighboring classes based on object parts. Again we use part features but after part cropping to retrieve nearest neighbor part patches of the input test image. The procedure described above provides an intuitive visual guide for distinguishing fine-grained categories.
In this paper, we propose a novel fine-grained recognition method called Deeper Part-Stacked CNN (DPS-CNN). The method exploits detailed part-level supervision, in which object parts are first located by a localization network and then by a two-stream classification network that explicitly captures object- and part-level information. We also present a new feature vector fusion strategy that effectively combines both part and object stream features. Experiments on CUB-200-2011 demonstrate the effectiveness and efficiency of our system. We also present human-understandable interpretations of the proposed method, which can be used as a visual field guide for studying fine-grained categorization.
DPS-CNN can be applied to fine-grained visual categorization with strong supervision and can be easily generalized to various applications including:
Discarding the requirement for strong supervision. Instead of introducing manually labeled part annotations to generate human-understandable visual guides, one can also exploit unsupervised part discovery methods  to define object parts automatically, which requires far less human labelling effort.
Attribute learning. The application of DPS-CNN is not restricted to FGVC. For instance, online shopping  performance could benefit from clothing attribute analysis from local parts provided by DPS-CNN.
Context-based CNN. The role of local “parts” in DPS-CNN is interchangeable with global contexts, especially for objects that are small and have no obvious object parts such as volleyballs or tennis balls.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-ucsd birds 200,” 2010.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
-  T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur, “Birdsnap: Large-scale fine-grained visual categorization of birds,” in Computer Vision and Pattern Recognition (CVPR), 2014. IEEE, 2014, pp. 2019–2026.
-  A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for fine-grained image categorization: Stanford dogs,” in Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), 2011.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3498–3505.
-  M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on. IEEE, 2008, pp. 722–729.
-  A. Angelova, S. Zhu, and Y. Lin, “Image segmentation for large-scale subcategory flower recognition,” in Applications of Computer Vision (WACV), 2013 IEEE Workshop on. IEEE, 2013, pp. 39–45.
M. Stark, J. Krause, B. Pepik, D. Meger, J. J. Little, B. Schiele, and D. Koller, “Fine-grained categorization for 3d scene understanding,”International Journal of Robotics Research, vol. 30, no. 13, pp. 1543–1552, 2011.
-  S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
-  C. Wah, S. Branson, P. Perona, and S. Belongie, “Multiclass recognition and part localization with humans in the loop,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 2524–2531.
-  A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji, R. Girshick, J. Kannala, E. Rahtu, I. Kokkinos, M. B. Blaschko, D. Weiss et al., “Understanding objects in detail with fine-grained attributes,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 3622–3629.
-  J. Krause, H. Jin, J. Yang, and L. Fei-Fei, “Fine-grained recognition without part annotations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5546–5555.
-  Z. Xu, S. Huang, Y. Zhang, and D. Tao, “Augmenting strong supervision using web data for fine-grained categorization,” in Computer Vision (ICCV), 2015 IEEE International Conference on, 2015.
-  J. Deng, J. Krause, and L. Fei-Fei, “Fine-grained crowdsourcing for fine-grained recognition,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 580–587.
-  Y. Chai, V. Lempitsky, and A. Zisserman, “Symbiotic segmentation and part localization for fine-grained categorization,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 321–328.
-  S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species categorization using pose normalized deep convolutional nets,” arXiv preprint arXiv:1406.2952, 2014.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1449–1457.
-  D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, and Z. Zhang, “Multiple granularity descriptors for fine-grained categorization,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2399–2406.
T. Berg and P. Belhumeur, “Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 955–962.
-  T. Berg and P. N. Belhumeur, “How do you tell a blackbird from a crow?” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 9–16.
-  N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C. Lopez, and J. V. Soares, “Leafsnap: A computer vision system for automatic plant species identification,” in Computer Vision–ECCV 2012. Springer, 2012, pp. 502–516.
-  S. Branson, G. Van Horn, C. Wah, P. Perona, and S. Belongie, “The ignorant led by the blind: A hybrid human–machine vision system for fine-grained categorization,” International Journal of Computer Vision, vol. 108, no. 1-2, pp. 3–29, 2014.
-  G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie, “Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 595–604.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson, and P. Boyes-Braem, “Basic objects in natural categories,” Cognitive psychology, vol. 8, no. 3, pp. 382–439, 1976.
-  S. Maji and G. Shakhnarovich, “Part and attribute discovery from relative annotations,” International Journal of Computer Vision, vol. 108, no. 1-2, pp. 82–96, 2014.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” in Computer Vision–ECCV 2014. Springer, 2014, pp. 834–849.
-  X. Zhang, H. Xiong, W. Zhou, and Q. Tian, “Fused one-vs-all mid-level features for fine-grained visual categorization,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 287–296.
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in
Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 2015, pp. 448–456.
-  S. Huang, Z. Xu, D. Tao, and Y. Zhang, “Part-stacked cnn for fine-grained visual categorization,” in Proceedings of the IEEE International Conference on Computer Vision, 2016.
-  N. Zhang, E. Shelhamer, Y. Gao, and T. Darrell, “Fine-grained pose prediction, normalization, and recognition,” arXiv preprint arXiv:1511.07063, 2015.
-  H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, and D. Metaxas, “Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition.”
-  X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1134–1142.
-  N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev, “Panda: Pose aligned networks for deep attribute modeling,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 1637–1644.
-  G. Gkioxari, R. Girshick, and J. Malik, “Actions and attributes from wholes and parts,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2470–2478.
-  J. Zhu, X. Chen, and A. L. Yuille, “Deepm: A deep part-based model for object detection and semantic part localization,” arXiv preprint arXiv:1511.07131, 2015.
-  S. Milborrow and F. Nicolls, “Locating facial features with an extended active shape model,” in European conference on computer vision. Springer, 2008, pp. 504–513.
-  T. F. Cootes, G. J. Edwards, C. J. Taylor et al., “Active appearance models,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 6, pp. 681–685, 2001.
-  I. Matthews and S. Baker, “Active appearance models revisited,” International Journal of Computer Vision, vol. 60, no. 2, pp. 135–164, 2004.
-  J. M. Saragih, S. Lucey, and J. F. Cohn, “Face alignment through subspace constrained mean-shifts,” in 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2009, pp. 1034–1041.
-  J. Liu and P. N. Belhumeur, “Bird part localization using exemplar-based models with enforced pose and subcategory consistency,” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 2520–2527.
-  J. Liu, Y. Li, and P. N. Belhumeur, “Part-pair representation for part localization,” in Computer Vision–ECCV 2014. Springer, 2014, pp. 456–471.
-  K. J. Shih, A. Mallya, S. Singh, and D. Hoiem, “Part localization using multi-proposal consensus for fine-grained categorization,” Proceedings of The British Machine Vision Conference (BMVC), 2015.
-  D. Lin, X. Shen, C. Lu, and J. Jia, “Deep lac: Deep localization, alignment and classification for fine-grained recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1666–1674.
-  X. Yu, F. Zhou, and M. Chandraker, “Deep deformation network for object landmark localization,” arXiv preprint arXiv:1605.01014, 2016.
-  R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” 2016.
-  L. Bo, X. Ren, and D. Fox, “Kernel descriptors for visual recognition,” in Advances in neural information processing systems, 2010, pp. 244–252.
-  J. Sánchez, F. Perronnin, and Z. Akata, “Fisher vectors for fine-grained visual categorization,” in FGVC Workshop in IEEE Computer Vision and Pattern Recognition (CVPR), 2011.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv preprint arXiv:1409.4842, 2014.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeulders, and T. Tuytelaars, “Local alignments for fine-grained categorization,” International Journal of Computer Vision, vol. 111, no. 2, pp. 191–212, 2015.
-  S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie, “Visual recognition with humans in the loop,” in Computer Vision–ECCV 2010. Springer, 2010, pp. 438–451.
-  C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie, “Similarity comparisons for interactive fine-grained categorization,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 859–866.
-  M. Simon and E. Rodner, “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1143–1151.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the ACM International Conference on Multimedia. ACM, 2014, pp. 675–678.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” arXiv preprint arXiv:1506.01497, 2015.
-  J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 9, pp. 1627–1645, 2010.
-  Y. Yang and D. Ramanan, “Articulated human detection with flexible mixtures of parts,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 12, pp. 2878–2890, 2013.
-  J. L. Long, N. Zhang, and T. Darrell, “Do convnets learn correspondence?” in Advances in Neural Information Processing Systems, 2014, pp. 1601–1609.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
M. Jaderberg, K. Simonyan, A. Zisserman et al.
, “Spatial transformer networks,” inAdvances in Neural Information Processing Systems, 2015, pp. 2017–2025.
T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, and Z. Zhang, “The application of two-level attention models in deep convolutional neural network for fine-grained image classification,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 842–850.
-  K. M. Hadi, H. Xufeng, L. Svetlana, B. Alexander, and B. Tamara, “Where to buy it: Matching street clothing photos in online shops,” in Computer Vision (ICCV), 2015 IEEE International Conference on, 2015.