Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition

05/20/2016 ∙ by Xiao Liu, et al. ∙ Baidu, Inc. 0

A key challenge in fine-grained recognition is how to find and represent discriminative local regions. Recent attention models are capable of learning discriminative region localizers only from category labels with reinforcement learning. However, not utilizing any explicit part information, they are not able to accurately find multiple distinctive regions. In this work, we introduce an attribute-guided attention localization scheme where the local region localizers are learned under the guidance of part attribute descriptions. By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm. The attribute labeling requirement of the scheme is more amenable than the accurate part location annotation required by traditional part-based fine-grained recognition methods. Experimental results on the CUB-200-2011 dataset demonstrate the superiority of the proposed scheme on both fine-grained recognition and attribute recognition.



There are no comments yet.


page 2

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans heavily rely on subtle local visual cues to distinguish fine-grained object categories. For example in Figure 1 (a), human experts differentiate a summer tanager and a scarlet tanager by the color of wing and tail. In order to build human-level fine-grained recognition AI systems, it is also essential to locate discriminative object parts and learn local visual representation for these parts.

State-of-the-art fine-grained recognition methods bd11 ; bd13 either rely on manually labeled parts to train part detectors in a fully supervised manner, or employ reinforcement learning or spatial-transformer-based attention models bd3 ; nips1 to locate object parts with object category annotations in a weakly supervised manner. However, both types of methods have major practical limitations. Fully supervised methods need time-consuming, error-probing manual object part labeling process, while object labels alone as a supervision signal is generally too weak to reliably locate multiple discriminative object parts. For example in Figure 1(b), the method nips1 fails to locate the tail as attention regions.

Humans have a remarkable ability to learn to locate object parts from multiple sources of information. Aside from strongly labeled object part location and weakly labeled object categories, part descriptions, such as “red wing”, also plays an important part in the development of object parts locating ability. In their early ages, children learn to recognize parts and locate object parts by reading or listening to part descriptions. Part descriptions do not require time-consuming manually part location labeling, and it is much stronger than object category labels. We call it part attribute.

Inspired by this capability, we propose a part attribute-guided attention localization scheme for fine-grained recognition. Using part attributes as a weak supervision training signal, reinforcement learning is able to learn part-specific optimal localization strategies given the same image as environment state. Based on this intuition, it is reasonable to expect that distinctive part localizers could be learned as strategies of looking for and describing the appearances of different parts. In the proposed scheme, multiple fully convolutional attention localization networks nips1 are trained. Each network predicts the attribute values of a part. We design a novel reward strategy for learning part localizers and part attribute predictors.

Part attribute-guided attention localization networks can more accurately locate object parts (Figure 1(c)). More importantly, using the part locations and appearance features from part-attribute guided attention localization networks leads to state-of-the-art performance on fine-grained recognition, as demonstrated on the CUB-200-2011 dataset bd6 . Moreover, part attribute can be acquired in large scale via either human labeling or data mining techniques. It has been successfully employed for image recognition parikh2011relative ; akata2013label ; hwang2014unified

, image retrieval 

huang2015cross , and image generation yan2015attribute2image . To our best knowledge, no one has utilized part attribute to guide the learning of visual attention.

(a) Two types of birds with part-attribute descriptions.
(b) Attention regions found by nips1 . (c) Part-attribute-guided attention.
Figure 1: (a) Two types of birds with part-attribute descriptions. Human experts differentiate a summer tanager and a scarlet tanager by the color of wing and tail. (b) nips1 fails to locate the tail part. (c) Part-attribute-guided attention localization networks can locate object parts more accurately. The red, green, purple and blue bounding boxes localize head, breast, wing and tail, respectively (best viewed in color).

2 Attribute-Guided Attention Localization

The architecture of the proposed scheme is shown in Figure 2. Attribute descriptions are only used for learning part localizers during training. They are not used in the testing stage.

In the training stage, a fully-convolutional attention localization network nips1 is learned for each part. In contrast with nips1 , the task of the fully-convolutional attention localization network is to learn where to look for better part attribute prediction rather than predicting the category of the entire object.

For testing, we extract features from both part regions located by the network and the entire image and concatenate them as a joint representation. The joint representation is utilized to predict the image category.

Figure 2: An overview architecture of the proposed scheme. The correspondence of color and part is the same as Fig. 1

. The upper part shows training stage, and the lower part shows the testing stage. In the training stage, multiple part localizers are trained under the guidance of attribute descriptions. In the testing stage, features extracted from the selected part regions and the entire image are combined into a joint representation for category prediction.

2.1 Problem Formulation

Given training images , and their object labels

. Our goal is to train a model for classifying each image

as its ground-truth label .

Fine-grained objects have parts. Localizing by describing problem finds a policy to locate the parts and to classify images based on these part locations and their features. The objective can be formulated as:


where is a function that finds the location of part and crops its image region. The crop size of each part is manually defined.

is a deep convolutional neural network classifier that outputs the probability of each category given the whole image and the cropped image regions for all the parts.

is cross-entropy loss function measuring the quality of the classification.

Precisely predicting part locations using only the image labels is very challenging. In localizing by describing, we localize object parts with the help of visual attribute descriptions. The visual attribute is a set of binary annotations whose value correlates with an aspect of the object appearance. A local attribute is generally related to the appearance of an object part. The attribute description of the -th part is denoted as . Each part description

is a binary vector:

, where each element indicates whether an attribute exists in the -th image, and is the number of attributes for the -th part. We aim to learn better part localizers using part attribute annotation. An auxiliary part attribute classification loss is proposed to facilitate the learning of part localizers


where is a multi-label attribute prediction function of the -th part, and each indicates the predicted probability of the -th attribute of part . is a multi-label cross-entropy loss


The assumption is that the part localizers that help predict part attributes are also beneficial to the prediction of the whole object. We use Eq. (3) as an auxiliary loss to learn localizer that optimizes Eq.(1).

2.2 Training of Localizers

Given images and attribute descriptions, we jointly learn a part localizer and a multi-label predictor for each part such that the predictor uses the selected local region for attribute prediction.

Since the localization operation is non-differential, we employ reinforcement learning algorithm bd20 to learn the part localizers and multi-label predictors. For reinforcement learning algorithm, the policy function is the part localizer function ; the state is the cropped local image patch and the whole image ; the reward function measures the quality of the part attribute and object prediction.

The objective function of the reinforcement learning algorithm for part is


where the reward


is the expected reward of the selected region for the -th part of the -th image. indicates a selected region, is the probability that the localizer select the region . is a reward function to evaluate the contribution of the selected region to attribute prediction.

Previous methods bd3 ; nips1 choose the reward function to be only when the image is correctly classified. However, since our algorithm predicts multiple attribute values for a part, it is too strict to enforce all the attributes are correctly predicted. Therefore, we consider an alternative reward strategy. A selected region has reward 1 if both of the following criteria are satisfied: 1) it achieves lower attribute classification loss than most other regions in the same image, i.e., its prediction loss ranks top- lowest among the sampled regions of the image. 2) it achieves lower attribute classification loss than most other regions in the same mini-batch, i.e., its prediction loss is lower than half of the average loss of all the regions in the mini-batch.

Following nips1 , we learn fully convolutional attention localization networks as part localizers. Since both parts of the objective function Eq. (4) are differentiable, REINFORCE algorithm bd20 is applied to compute the policy gradient to optimize the objective function:


where is the local image regions sampled according to localizer policy . local regions of the same image are sampled in a mini-batch. We list the learning algorithm in Algorithm 1.

0:   training images , attribute descriptions of each part .
0:  part localization function , multi-label attribute prediction function .
1:  for each part  do
2:     Initialize and .
3:     repeat
4:        Randomly sample images.
5:        for each image  do
6:           Sample regions according to the output of the part localizer:
7:        end for
8:        for each local region  do
9:           Calculate the multi-label attribute prediction loss .
10:        end for
11:        Calculate the average loss of the mini-batch .
12:        for each image  do
13:           Sort in ascending order.
14:           for each local region  do
15:              if  is in the top- position of the sorted list, and  then
16:                 Set reward .
17:              else
18:                 Set reward .
19:              end if
20:           end for
21:        end for
22:        Calculate the policy gradient of according to (6).
23:        Calculate the gradient of by standard back-propagation.
24:        Update the parameters of and .
25:     until converge
26:  end for
Algorithm 1 Learning algorithm for localizing by describing:

2.3 Training of Classifiers

After the local region localizers are trained, we re-trained the attribute prediction models using up-scaled local regions. When re-trained, the attribute predictors takes up-scaled local regions from the part localizers to predict the attributes.

To combine global and local information, we extract features from all the part regions and the entire image and concatenate them to form a joint representation. The joint representation is used to predict image category. In details, we first train classifiers with each individual part region to capture the appearance details of local parts for fine-grained recognition. A classifier utilizing the entire image is also trained for global information. We then concatenate features extracted from all the parts and the entire image as a joint representation, and we use a linear layer to combine the features.

2.4 Prediction

The prediction process is illustrated in the lower part of Fig. 2. We localize and crop each part using its localizer , and each part region is resized to high resolution for feature extraction. Features from all part regions as well as the entire image are concatenated as a joint representation. A linear classification layer is utilized to make the final category prediction.

We also attempt to use attribute prediction results to help fine-grained object recognition, or model the geometric relationship of the parts using recurrent convolutional operation. However, we find neither of the approaches achieve notable improvements in our experiments. Detailed experimental results and setup can be found in the experimental section.

3 Experiments

We conduct experiments on the CUB-200-2011 datasets bd6 . The dataset contains images of bird categories, where images are for training, and the rest images are for testing. In addition to the category label, 15 part locations, 312 binary attributes and a tight bounding box of the bird is provided for each image. Examples of images and attributes in this dataset are shown in Figure 1(a).

3.1 Implementation Details

We evaluate the proposed scheme in two scenarios: “with BB” where the object bounding box is utilized during training and testing, and “without BB” where the object bounding-box is not utilized.

We choose 4 parts, i.e. “head”, “wing”, “breast”, and “tail”, to train local region localizers. The cropping size of these parts are half of the original image. Among all the 312 attributes, if a part name appears in an attribute, then the attribute is considered describing this part. The number of attributes describing the four parts are 29, 24, 19, and 40, respectively.

We utilize ResNet-50 nips2 as the visual representation for part localization and feature extraction. In the training stage, we utilize the entire image (in the “with BB” setting, crop the image region within the bounding box) to learn multi-label attribute predictions for each part. The output of the “res5c” layer of ResNet-50 is employed as the input of the fully convolutional attention localization networks. The attribute predictors use ROI-pooled feature maps girshick2015fast of the fully convolutional attention localizers.

We train the models using Stochastic Gradient Descent (SGD) with momentum of 0.9, epoch number of 150, weight decay of 0.001, and a mini-batch size of 28 on four K40 GPUs. One epoch means all training samples are passed through once. An additional dropout layer with an ratio of 0.5 is added after “res5c”, and the size of “fc15” is changed from 1000 to 200.

The parameters before “res5c” are initialized by the model nips2

pretrained on the ImageNet dataset

bd18 , and parameters of fc15 are randomly initialized. The initial learning rate is set at 0.0001 and reduced twice with a ratio of 0.1 after 50 and 100 epoches. The learning rate of the last layer (“fc15”) is 10 times larger than other layers.

Our data augmentation is similar to bd7 , but we have more types of data augmentation. A training image is first rotated with a random angle between and . A cropping is then applied on the rotated image. The size of the cropped patch is chosen randomly between and of the whole image, and its aspect ratio is chosen randomly between and . AlexNet-style color augmentation nips3 is also applied followed by random flip. We finally resize the transformed cropped patch to a image as the input of the convolutional neural network.

3.2 Part Localization Results

We report our part localization results in Table 1. Percent Correct Parts (PCP) is used as the evaluation metric. A part is determined to be correctly localized if the difference of its predicted location and the ground-truth location is within 1.5 times the ground-truth annotation’s standard deviation.

We compare with previous state-of-the-art part localization methods nips5 ; nips6 ; nips7 . The strongly supervised method nips5 , which utilizes the precise location of the parts during training, achieves the highest average PCP (71.4) in the “without BB” scenario. Our scheme that does not use any part or object locations achieves the second highest average PCP (66.2) and performs the best for localizing “breast”. The parts are localized much more precisely () when ground-truth bird bounding boxes is used.

Figure 3 provides visualizations of our part localization results. Ground-truth part locations are shown as hollow circles, predicted part locations are shown as solid circles and the selected part regions are shown as thumbnails.

It should be noted that different from nips5 , our scheme does not directly minimize the part localization error but learns part localizers for attribute prediction. Thus, a selected region that is far from manually annotation might better predict some part attributes. For example, our predicted “head” positions are usually in the center of the bird head such that the cropped local region does not lose any information, while the manually annotated “head” positions normally appear on the “forehead”.

Method head breast wing tail ave
Shih et al. nips5 67.6 77.8 81.3 59.2 71.4
Liu et al. nips6 58.5 67.0 71.6 40.2 59.3
Liu et al. nips7 72.0 70.5 74.4 46.2 65.8
Ours (without BB) 60.6 79.5 77.5 47.1 66.2
Ours (with BB) 69.3 81.5 80.3 62.5 73.4
Table 1: Part localization results (measured by PCP) on the CUB-200-2011 dataset.
Figure 3: Visualizations of part localization results. Ground-truth part locations are shown as hollow circles; predicted part locations are shown as solid circles; selected part regions are shown as thumbnails. Different parts are shown in different colors. No ground-truth bird bounding boxes or part locations information are used during training or testing time in the proposed method.

3.3 Attribute Prediction Results

The attribute prediction results measured by average Area Under the Curve of ROC (AUC) are reported in Table 2. We use AUC instead of accuracy because the data of attributes are highly imbalanced: most attributes are only activated in a very small number of images, but some attributes appear very frequently. For each part, we calculate the AUC of all its attributes, and report the average AUC of them as the AUC of the part.

Directly utilizing the full image as input achieves , , , and average AUC for parts “head”, “breast”, “wing” and “tail”, respectively. The overall average AUC of the four parts is 74.3. Using the localized local regions results in slight performance drop () on overall average AUC. The prediction using both the full image and the local attention regions improves the overall average AUC result to 76.6. Bounding boxes of birds are not used for attribute prediction.

Method head breast wing tail total
Full image 76.8 78.7 73.1 70.4 74.3
Attention 77.8 78.3 71.9 68.8 73.7
Full Image + Attention 80.3 80.7 75.5 72.1 76.6
Table 2: Attribute prediction results (measured by average AUC) on the CUB-200-2011 dataset.

3.4 Fine-grained Recognition Results

For “without BB” scenario, the baseline ResNet-50 using the whole image achieves recognition accuracy. Adding features of two parts (“head” and “breast”) improves the result to and combing features of four parts improves the result to .

For “with BB” scenario, the baseline achieves accuracy. Combing features of two parts improves the result to , and combing all the features improves the accuracy to .

We carry out further experiments to explore using the attributes for better recognition.

In the “full image + attribute value” experiment, we concatenate 112 binary attribute labels with the original visual feature of the whole image to predict the bird category. In the “full image + attribute feature” experiment, we concatenate the visual features of the 5 part attribute prediction models: one is the original full image model, and the other four models are fine-tuned for predicting the attributes of the four parts.

As Table 3 shows, directly combining attribute values does not improve recognition accuracy compared with the baseline. Combing attribute features leads to marginal improvements ( and ), because we find the attribute predictions are usually noisy due to the inherent difficulty of predicting some local attributes . By combining features from the full image model, four part-based models and the attribute prediction models, we achieve the and for the ”without BB” and “with BB” scenarios, respectively.

We also explore jointly training the part localizers using a recurrent convolutional neural network model as geometric regularization of part locations nips8 , but we find the accuracy improvement is negligible ( without BB). After examining the data, we find birds’ poses are too diverse to learn an effective geometric model from limited amount of data.

Method Acc without BB(%) Acc with BB(%)
Lin et al. bd16 84.1 85.1
Krause et al. bd22 82.0 82.8
Zhang et al. bd11 73.9 76.4
Liu et al. nips1 82.0 84.3
Jaderberg et al. jaderberg2015spatial 84.1 -
Full Image 81.7 82.3
Full Image + 2parts 84.7 84.9
Full Image + 4parts 85.1 85.3
Full Image + attribute value 81.7 82.2
Full Image + attribute feature 82.5 82.9
Full Image + 4parts + attribute feature 85.4 85.5
Table 3: Recognition results on the CUB-200-2011 dataset with different settings.

We compare with previous state-of-the-art methods on this dataset and summarize the recognition results in Table 3. The proposed scheme outperforms state-of-the-art methods that use the same amount of supervision.

Lin et al. bd16 construct high dimensional bilinear feature vectors, and achieve and accuracy for the ”without BB” and “with BB” scenarios, respectively. Krause et al. bd22 learn and combine multiple latent parts in a weakly supervised manner, and achieve and accuracy for the ”without BB” and “with BB” scenarios, respectively. Zhang et al. bd11 train part-based R-CNN model to detect the head and body of the bird. The method relies on part location annotation during training. Our scheme outperforms bd11 by a large margin without requiring strongly supervised part location annotation. Liu et al. nips1 utilize the fully convolutional attention localization networks to select two local parts for model combination. The accuracy is without bounding box and with bounding box, while our accuracy is without bounding box and with bounding box by combing features from two local parts (“head” and “breast”). Localizing distinctive parts leads to better recognition accuracy. Similarly, Jaderberg et al. jaderberg2015spatial combine features of four local parts and achieve an accuracy of without using bounding box. Our scheme using the same number of parts outperforms it by ().

3.5 Reward Strategy Visualization

The rewards during reinforcement learning algorithm are visualized in Figure  4. From left to right, we show the heatmaps of 1st, 40-th, 80-th, 120-th, 160-th, and 200-th iterations for multi-label attribute prediction loss, rewards and the output probability of the localizer. As can be seen, after an initial divergence on localizer probability map, the output of the localizer converges to the “head” position during training as expected.

Figure 4: Our algorithm uses 200 iterations to locate “head” of the bird in the left part. The top row in the right part shows the multi-label attribute prediction loss at different positions, and a lighter position has higher loss. The middle row shows the rewards at different positions. The localizer is encouraged to focus on the light position. The bottom row shows the probability map of the localizer. Lighter positions indicate larger probability of localization.

4 Conclusion

In this paper, we present an attention localization scheme for fine-grained recognition that learns part localizers from its attribute descriptions. An efficient reinforcement learning scheme is proposed for this task. The proposed scheme consists of a reward function for encouraging different part localizers to capture complementary information. It is also highly computationally efficient when the number of attributes and parts is large. Comprehensive experiments show that our scheme obtains good part localization, improves attribute prediction, and achieves state-of-the-art recognition results on fine-grained recognition. In the future, we will continue our efforts to improve the models of geometric part location regularization.