Humans heavily rely on subtle local visual cues to distinguish fine-grained object categories. For example in Figure 1 (a), human experts differentiate a summer tanager and a scarlet tanager by the color of wing and tail. In order to build human-level fine-grained recognition AI systems, it is also essential to locate discriminative object parts and learn local visual representation for these parts.
State-of-the-art fine-grained recognition methods bd11 ; bd13 either rely on manually labeled parts to train part detectors in a fully supervised manner, or employ reinforcement learning or spatial-transformer-based attention models bd3 ; nips1 to locate object parts with object category annotations in a weakly supervised manner. However, both types of methods have major practical limitations. Fully supervised methods need time-consuming, error-probing manual object part labeling process, while object labels alone as a supervision signal is generally too weak to reliably locate multiple discriminative object parts. For example in Figure 1(b), the method nips1 fails to locate the tail as attention regions.
Humans have a remarkable ability to learn to locate object parts from multiple sources of information. Aside from strongly labeled object part location and weakly labeled object categories, part descriptions, such as “red wing”, also plays an important part in the development of object parts locating ability. In their early ages, children learn to recognize parts and locate object parts by reading or listening to part descriptions. Part descriptions do not require time-consuming manually part location labeling, and it is much stronger than object category labels. We call it part attribute.
Inspired by this capability, we propose a part attribute-guided attention localization scheme for fine-grained recognition. Using part attributes as a weak supervision training signal, reinforcement learning is able to learn part-specific optimal localization strategies given the same image as environment state. Based on this intuition, it is reasonable to expect that distinctive part localizers could be learned as strategies of looking for and describing the appearances of different parts. In the proposed scheme, multiple fully convolutional attention localization networks nips1 are trained. Each network predicts the attribute values of a part. We design a novel reward strategy for learning part localizers and part attribute predictors.
Part attribute-guided attention localization networks can more accurately locate object parts (Figure 1(c)). More importantly, using the part locations and appearance features from part-attribute guided attention localization networks leads to state-of-the-art performance on fine-grained recognition, as demonstrated on the CUB-200-2011 dataset bd6 . Moreover, part attribute can be acquired in large scale via either human labeling or data mining techniques. It has been successfully employed for image recognition parikh2011relative ; akata2013label ; hwang2014unifiedhuang2015cross , and image generation yan2015attribute2image . To our best knowledge, no one has utilized part attribute to guide the learning of visual attention.
2 Attribute-Guided Attention Localization
The architecture of the proposed scheme is shown in Figure 2. Attribute descriptions are only used for learning part localizers during training. They are not used in the testing stage.
In the training stage, a fully-convolutional attention localization network nips1 is learned for each part. In contrast with nips1 , the task of the fully-convolutional attention localization network is to learn where to look for better part attribute prediction rather than predicting the category of the entire object.
For testing, we extract features from both part regions located by the network and the entire image and concatenate them as a joint representation. The joint representation is utilized to predict the image category.
2.1 Problem Formulation
Given training images , and their object labels
. Our goal is to train a model for classifying each imageas its ground-truth label .
Fine-grained objects have parts. Localizing by describing problem finds a policy to locate the parts and to classify images based on these part locations and their features. The objective can be formulated as:
where is a function that finds the location of part and crops its image region. The crop size of each part is manually defined.
is cross-entropy loss function measuring the quality of the classification.
Precisely predicting part locations using only the image labels is very challenging. In localizing by describing, we localize object parts with the help of visual attribute descriptions. The visual attribute is a set of binary annotations whose value correlates with an aspect of the object appearance. A local attribute is generally related to the appearance of an object part. The attribute description of the -th part is denoted as . Each part description
is a binary vector:, where each element indicates whether an attribute exists in the -th image, and is the number of attributes for the -th part. We aim to learn better part localizers using part attribute annotation. An auxiliary part attribute classification loss is proposed to facilitate the learning of part localizers
where is a multi-label attribute prediction function of the -th part, and each indicates the predicted probability of the -th attribute of part . is a multi-label cross-entropy loss
2.2 Training of Localizers
Given images and attribute descriptions, we jointly learn a part localizer and a multi-label predictor for each part such that the predictor uses the selected local region for attribute prediction.
Since the localization operation is non-differential, we employ reinforcement learning algorithm bd20 to learn the part localizers and multi-label predictors. For reinforcement learning algorithm, the policy function is the part localizer function ; the state is the cropped local image patch and the whole image ; the reward function measures the quality of the part attribute and object prediction.
The objective function of the reinforcement learning algorithm for part is
where the reward
is the expected reward of the selected region for the -th part of the -th image. indicates a selected region, is the probability that the localizer select the region . is a reward function to evaluate the contribution of the selected region to attribute prediction.
Previous methods bd3 ; nips1 choose the reward function to be only when the image is correctly classified. However, since our algorithm predicts multiple attribute values for a part, it is too strict to enforce all the attributes are correctly predicted. Therefore, we consider an alternative reward strategy. A selected region has reward 1 if both of the following criteria are satisfied: 1) it achieves lower attribute classification loss than most other regions in the same image, i.e., its prediction loss ranks top- lowest among the sampled regions of the image. 2) it achieves lower attribute classification loss than most other regions in the same mini-batch, i.e., its prediction loss is lower than half of the average loss of all the regions in the mini-batch.
Following nips1 , we learn fully convolutional attention localization networks as part localizers. Since both parts of the objective function Eq. (4) are differentiable, REINFORCE algorithm bd20 is applied to compute the policy gradient to optimize the objective function:
where is the local image regions sampled according to localizer policy . local regions of the same image are sampled in a mini-batch. We list the learning algorithm in Algorithm 1.
2.3 Training of Classifiers
After the local region localizers are trained, we re-trained the attribute prediction models using up-scaled local regions. When re-trained, the attribute predictors takes up-scaled local regions from the part localizers to predict the attributes.
To combine global and local information, we extract features from all the part regions and the entire image and concatenate them to form a joint representation. The joint representation is used to predict image category. In details, we first train classifiers with each individual part region to capture the appearance details of local parts for fine-grained recognition. A classifier utilizing the entire image is also trained for global information. We then concatenate features extracted from all the parts and the entire image as a joint representation, and we use a linear layer to combine the features.
The prediction process is illustrated in the lower part of Fig. 2. We localize and crop each part using its localizer , and each part region is resized to high resolution for feature extraction. Features from all part regions as well as the entire image are concatenated as a joint representation. A linear classification layer is utilized to make the final category prediction.
We also attempt to use attribute prediction results to help fine-grained object recognition, or model the geometric relationship of the parts using recurrent convolutional operation. However, we find neither of the approaches achieve notable improvements in our experiments. Detailed experimental results and setup can be found in the experimental section.
We conduct experiments on the CUB-200-2011 datasets bd6 . The dataset contains images of bird categories, where images are for training, and the rest images are for testing. In addition to the category label, 15 part locations, 312 binary attributes and a tight bounding box of the bird is provided for each image. Examples of images and attributes in this dataset are shown in Figure 1(a).
3.1 Implementation Details
We evaluate the proposed scheme in two scenarios: “with BB” where the object bounding box is utilized during training and testing, and “without BB” where the object bounding-box is not utilized.
We choose 4 parts, i.e. “head”, “wing”, “breast”, and “tail”, to train local region localizers. The cropping size of these parts are half of the original image. Among all the 312 attributes, if a part name appears in an attribute, then the attribute is considered describing this part. The number of attributes describing the four parts are 29, 24, 19, and 40, respectively.
We utilize ResNet-50 nips2 as the visual representation for part localization and feature extraction. In the training stage, we utilize the entire image (in the “with BB” setting, crop the image region within the bounding box) to learn multi-label attribute predictions for each part. The output of the “res5c” layer of ResNet-50 is employed as the input of the fully convolutional attention localization networks. The attribute predictors use ROI-pooled feature maps girshick2015fast of the fully convolutional attention localizers.
We train the models using Stochastic Gradient Descent (SGD) with momentum of 0.9, epoch number of 150, weight decay of 0.001, and a mini-batch size of 28 on four K40 GPUs. One epoch means all training samples are passed through once. An additional dropout layer with an ratio of 0.5 is added after “res5c”, and the size of “fc15” is changed from 1000 to 200.
The parameters before “res5c” are initialized by the model nips2
pretrained on the ImageNet datasetbd18 , and parameters of fc15 are randomly initialized. The initial learning rate is set at 0.0001 and reduced twice with a ratio of 0.1 after 50 and 100 epoches. The learning rate of the last layer (“fc15”) is 10 times larger than other layers.
Our data augmentation is similar to bd7 , but we have more types of data augmentation. A training image is first rotated with a random angle between and . A cropping is then applied on the rotated image. The size of the cropped patch is chosen randomly between and of the whole image, and its aspect ratio is chosen randomly between and . AlexNet-style color augmentation nips3 is also applied followed by random flip. We finally resize the transformed cropped patch to a image as the input of the convolutional neural network.
3.2 Part Localization Results
We report our part localization results in Table 1. Percent Correct Parts (PCP) is used as the evaluation metric. A part is determined to be correctly localized if the difference of its predicted location and the ground-truth location is within 1.5 times the ground-truth annotation’s standard deviation.
We compare with previous state-of-the-art part localization methods nips5 ; nips6 ; nips7 . The strongly supervised method nips5 , which utilizes the precise location of the parts during training, achieves the highest average PCP (71.4) in the “without BB” scenario. Our scheme that does not use any part or object locations achieves the second highest average PCP (66.2) and performs the best for localizing “breast”. The parts are localized much more precisely () when ground-truth bird bounding boxes is used.
Figure 3 provides visualizations of our part localization results. Ground-truth part locations are shown as hollow circles, predicted part locations are shown as solid circles and the selected part regions are shown as thumbnails.
It should be noted that different from nips5 , our scheme does not directly minimize the part localization error but learns part localizers for attribute prediction. Thus, a selected region that is far from manually annotation might better predict some part attributes. For example, our predicted “head” positions are usually in the center of the bird head such that the cropped local region does not lose any information, while the manually annotated “head” positions normally appear on the “forehead”.
|Shih et al. nips5||67.6||77.8||81.3||59.2||71.4|
|Liu et al. nips6||58.5||67.0||71.6||40.2||59.3|
|Liu et al. nips7||72.0||70.5||74.4||46.2||65.8|
|Ours (without BB)||60.6||79.5||77.5||47.1||66.2|
|Ours (with BB)||69.3||81.5||80.3||62.5||73.4|
3.3 Attribute Prediction Results
The attribute prediction results measured by average Area Under the Curve of ROC (AUC) are reported in Table 2. We use AUC instead of accuracy because the data of attributes are highly imbalanced: most attributes are only activated in a very small number of images, but some attributes appear very frequently. For each part, we calculate the AUC of all its attributes, and report the average AUC of them as the AUC of the part.
Directly utilizing the full image as input achieves , , , and average AUC for parts “head”, “breast”, “wing” and “tail”, respectively. The overall average AUC of the four parts is 74.3. Using the localized local regions results in slight performance drop () on overall average AUC. The prediction using both the full image and the local attention regions improves the overall average AUC result to 76.6. Bounding boxes of birds are not used for attribute prediction.
|Full Image + Attention||80.3||80.7||75.5||72.1||76.6|
3.4 Fine-grained Recognition Results
For “without BB” scenario, the baseline ResNet-50 using the whole image achieves recognition accuracy. Adding features of two parts (“head” and “breast”) improves the result to and combing features of four parts improves the result to .
For “with BB” scenario, the baseline achieves accuracy. Combing features of two parts improves the result to , and combing all the features improves the accuracy to .
We carry out further experiments to explore using the attributes for better recognition.
In the “full image + attribute value” experiment, we concatenate 112 binary attribute labels with the original visual feature of the whole image to predict the bird category. In the “full image + attribute feature” experiment, we concatenate the visual features of the 5 part attribute prediction models: one is the original full image model, and the other four models are fine-tuned for predicting the attributes of the four parts.
As Table 3 shows, directly combining attribute values does not improve recognition accuracy compared with the baseline. Combing attribute features leads to marginal improvements ( and ), because we find the attribute predictions are usually noisy due to the inherent difficulty of predicting some local attributes . By combining features from the full image model, four part-based models and the attribute prediction models, we achieve the and for the ”without BB” and “with BB” scenarios, respectively.
We also explore jointly training the part localizers using a recurrent convolutional neural network model as geometric regularization of part locations nips8 , but we find the accuracy improvement is negligible ( without BB). After examining the data, we find birds’ poses are too diverse to learn an effective geometric model from limited amount of data.
|Method||Acc without BB(%)||Acc with BB(%)|
|Lin et al. bd16||84.1||85.1|
|Krause et al. bd22||82.0||82.8|
|Zhang et al. bd11||73.9||76.4|
|Liu et al. nips1||82.0||84.3|
|Jaderberg et al. jaderberg2015spatial||84.1||-|
|Full Image + 2parts||84.7||84.9|
|Full Image + 4parts||85.1||85.3|
|Full Image + attribute value||81.7||82.2|
|Full Image + attribute feature||82.5||82.9|
|Full Image + 4parts + attribute feature||85.4||85.5|
We compare with previous state-of-the-art methods on this dataset and summarize the recognition results in Table 3. The proposed scheme outperforms state-of-the-art methods that use the same amount of supervision.
Lin et al. bd16 construct high dimensional bilinear feature vectors, and achieve and accuracy for the ”without BB” and “with BB” scenarios, respectively. Krause et al. bd22 learn and combine multiple latent parts in a weakly supervised manner, and achieve and accuracy for the ”without BB” and “with BB” scenarios, respectively. Zhang et al. bd11 train part-based R-CNN model to detect the head and body of the bird. The method relies on part location annotation during training. Our scheme outperforms bd11 by a large margin without requiring strongly supervised part location annotation. Liu et al. nips1 utilize the fully convolutional attention localization networks to select two local parts for model combination. The accuracy is without bounding box and with bounding box, while our accuracy is without bounding box and with bounding box by combing features from two local parts (“head” and “breast”). Localizing distinctive parts leads to better recognition accuracy. Similarly, Jaderberg et al. jaderberg2015spatial combine features of four local parts and achieve an accuracy of without using bounding box. Our scheme using the same number of parts outperforms it by ().
3.5 Reward Strategy Visualization
The rewards during reinforcement learning algorithm are visualized in Figure 4. From left to right, we show the heatmaps of 1st, 40-th, 80-th, 120-th, 160-th, and 200-th iterations for multi-label attribute prediction loss, rewards and the output probability of the localizer. As can be seen, after an initial divergence on localizer probability map, the output of the localizer converges to the “head” position during training as expected.
In this paper, we present an attention localization scheme for fine-grained recognition that learns part localizers from its attribute descriptions. An efficient reinforcement learning scheme is proposed for this task. The proposed scheme consists of a reward function for encouraging different part localizers to capture complementary information. It is also highly computationally efficient when the number of attributes and parts is large. Comprehensive experiments show that our scheme obtains good part localization, improves attribute prediction, and achieves state-of-the-art recognition results on fine-grained recognition. In the future, we will continue our efforts to improve the models of geometric part location regularization.
- (1) C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
- (2) N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” Proc. ECCV, 2014.
- (3) E. Gavves, B. Fernando, C. Snoek, A. Smeulders, and T. Tuytelaars, “Fine-grained categorization by alignments,” Proc. ICCV, 2013.
- (4) P. Sermanet, A. Frome, and E. Real, “Attention for fine-grained categorization,” arXiv:1412.7054v3, 2015.
- (5) X. Liu, T. Xia, J. Wang, and Y. Lin, “Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition,” arXiv: 1603.06765, 2016.
- (6) D. Parikh and K. Grauman, “Relative attributes,” Proc. ICCV, 2011.
- (7) Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, “Label-embedding for attribute-based classification,” Proc. ECCV, 2013.
- (8) S. J. Hwang and L. Sigal, “A unified semantic embedding: Relating taxonomies and attributes,” Proc. NIPS, 2014.
- (9) J. Huang, R. S. Feris, Q. Chen, and S. Yan, “Cross-domain image retrieval with a dual attribute-aware ranking network,” Proc. ICCV, 2015.
- (10) X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation from visual attributes,” arXiv:1512.00570, 2015.
- (11) R. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, 1992.
- (12) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. CVPR, 2016.
- (13) R. Girshick, “Fast r-cnn,” arXiv preprint arXiv:1504.08083, 2015.
- (14) J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” Proc. CVPR, 2009.
- (15) C. Szegedy, W. Liu, Y. Jia, P. Sermannet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv:1409.4842, 2014.
- (16) A. Krizhevsky, I. Sutskever, and G. E.Hinton, “Imagenet classification with deep convolutional neural networks,” Proc. NIPS, 2012.
- (17) K. Shih, A. Mallya, S. Singh, and D. Hoiem, “Part-localization using multi-proposal consensus for fine-grained categorization,” Proc. BMVC, 2015.
- (18) J. Liu and P. Belhumeur, “Bird part localization using exemplar-based models with enforced pose and subcategory consistency,” Proc. ICCV, 2013.
- (19) J. Liu, Y. Li, and P. Belhumeur, “Part-pair representation for part localization,” Proc. ECCV, 2014.
J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,”Proc. NIPS, 2014.
- (21) T. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” Proc. ICCV, 2015.
- (22) J. Krause, H. Jin, and J. Y. an L. Fei-Fei, “Fine-grained recognition without part annotations,” Proc. CVPR, 2015.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,”Proc. NIPS, 2015.