Attention-based Joint Detection of Object and Semantic Part

07/05/2020 ∙ by Keval Morabia, et al. ∙ University of Illinois at Urbana-Champaign 16

In this paper, we address the problem of joint detection of objects like dog and its semantic parts like face, leg, etc. Our model is created on top of two Faster-RCNN models that share their features to perform a novel Attention-based feature fusion of related Object and Part features to get enhanced representations of both. These representations are used for final classification and bounding box regression separately for both models. Our experiments on the PASCAL-Part 2010 dataset show that joint detection can simultaneously improve both object detection and part detection in terms of mean Average Precision (mAP) at IoU=0.5.



There are no comments yet.


page 2

page 3

page 4

page 6

page 7

Code Repositories


Joint detection of Object and its Semantic parts using Attention-based Feature Fusion on PASCAL Parts 2010 dataset

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advances in camera and imaging technology, the amount of available online content with images and videos is ever-increasing. This makes it impossible for any human to manually parse and make sense out of images and motivates the development of automated techniques to parse images. One such popular and immensely important task is automated object detection in images. Recently, there have been major developments in this domain ([4], [5], [3]) but the diversity of this problem poses new challenges motivating active research in this direction.

Humans identify objects not only by looking at the object as a whole, but looking at various parts of the object to gain confidence. Taking inspiration from this idea, there have been some recent researches ([2], [7]). A concrete example is, identifying a human in an image by identifying parts like hands, faces etc. to add confidence to classification. Another example is identifying an aeroplane by using information of its wings. [7] propose a LSTM-based architecture for joint modelling of semantic-part information with full object detection. Recently, self-attention based transformer architectures [6] have gained popularity which eliminated the use of Recurrent Model like RNN for modeling contextual information. We aim to borrow ideas of self-attention in our setting where semantic parts associated with an object will act as different contexts which will be weighted based on the attention score.

In this project, we propose a novel model for joint detection of an object as a whole and its semantic parts using an attention-based feature fusion of object and its related parts. We use the PASCAL-Part 2010 Dataset ([1]

) that contains segmentation masks for object and its various semantic parts. From our limited experiments for Animal objects (bird, cat, cow, dog, horse, sheep) and part (face, leg, neck, tail, torso, wings) detection, we find that our joint detection model can give slight improvement for object detection and a reasonable improvement for part detection in terms of mean Average Precision (mAP) at Intersection over Union (IoU) threshold of 0.5. We believe a thorough hyperparameter tuning can lead to even better results.

2 Model Architecture

As a starting point, we use pyTorch’s Faster-RCNN implementation

111 (with Resnet50 backbone and Feature Pyramid Network) for object detection. Then we will use another Faster-RCNN model for semantic part detection which will be trained only to identify parts and not objects.

We believe that both these object detection and part detection model performances can be improved by using information of the other in a joint training architecture. Our model architecture is highly motivated from [1]

in that we replace the Relationship modeling and LSTM-based feature fusion with an Attention-based feature fusion architecture. We divide our model architecture in 3 parts: (i) Region Feature Extraction, (ii) Attention-based Feature Fusion, and (iii) Region Classification & Regression as shown in Fig:


Figure 1: Architecture for Joint Detection of Object and Semantic Part using Attention-based feature fusion
Figure 2: (a) Example of related parts for an object. (b) Example of related objects for a part. Here green boxes are selected for fusion (thresh=0.9) where as red boxes are not selected since the intersection area of object and part is less than fusion_thresh * part_area for red boxes

2.1 Region Feature Extraction

In this part, we take the image as input and pass it to 2 different Region Proposal Networks (RPN) [5], one for object and another for parts, and produce object and part proposals respectively. From the feature map produced by the RPNs, we apply Multi-Scale RoI Align with a Feature Pyramid Network to get a fixed sized feature for each object/part proposal.

2.2 Attention-based Feature Fusion

Once we have all the object and part proposals from the previous step, we get an enhanced representation of object and part proposals that can be used in next step for final object and part detection. To get this new features for object and part boxes, we perform an attention-based feature fusion for object (or parts) and its related parts (or objects). We define a hyperparameter called that decides which object and part proposals boxes are related to each other and should undergo fusion. We consider those object and part boxes where:

Fig:2 shows an example of related objects (or parts) that will be used for fusion to get fused part representation of related parts (or related objects).

Once we have related parts (or objects) for an object (or part), we learn an attention layer that gives a score for each object-part pair. This score will be used to get weighted average of features of all related parts (or objects) for the object (or part). This weighted average is the enhanced surrounding part (or object) representation for an object (or part). Note that the reason why we are weighing each related parts (or objects) for an object (or part) is because not all related object part pairs are relevant, for example a person can be standing next to a car as well, and not all nearby objects (or parts) are equally important. We expect this to give better performance than naive average of related parts (or objects).

2.3 Region Classification & Regression

In this step, we use the object features concatenated with fused related part features for final classification and bounding box regression for objects. Similarly, we use the part features concatenated with fused related object features for final classification and bounding box regression for parts.

Figure 3: Example images and ground truth object bounding boxes from PASCAL VOC 2010 dataset
Figure 4: Example segmentation annotations from the PASCAL-Part Dataset

3 PASCAL-Part Dataset 2010

Figure 5: Example of merged smaller part classes into a coarse-grained part class for PASCAL-Part Dataset

PASCAL VOC 2010 dataset 222 is a popular dataset used to benchmark Object Detection models in which each image has an annotation file containing bounding box coordinates and class labels for each object as shown in Fig:3. There are 20 classes present in the dataset which can be categorized into 4 super categories namely Person, Animal, Vehicle, Indoor. Training and validation contain 10,103 images while testing contains 9,637 images. The twenty object classes in Pascal VOC 2010 dataset are:

  1. Person: person

  2. Animal: bird, cat, cow, dog, horse, sheep

  3. Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train

  4. Indoor: bottle, chair, dining table, potted plant, sofa, tvmonitor

Category # Train Samples # Validation Samples
person 1714 1829
animals 1872 1886
vehicles 1493 1483
indoor 692 697
Table 1: Overall category-wise distribution in PASCAL-Part Dataset

The PASCAL-Part Dataset ([1]) is a set of additional annotations for PASCAL VOC 2010 as shown in Fig:4. It provides segmentation masks for each body part of the object and silhoutte annotation for categories that do not have a consistent set of parts (eg: boat). The part annotations were originally given in .mat format which we preprocessed using the scipy module to convert it into .json format. The annotations only contain segmentation masks, so we use the rectangular box containing this segmentation as the bounding box coordinates that are used for Object Detection. This dataset can also be used for animal part detection and general part detection. Note that these part annotations are only given for train and val images.

3.1 Dataset Statistics

Class # Samples
person.head 3960
person.lear 937
horse.lear 213
cow.tail 91
car.door_3 1
Table 2: Uneven part distribution in raw training data
Category # Train Samples # Validation Samples
horse 208 316
dog 706 708
cat 563 568
bird 484 484
sheep 350 358
cow 233 239
Table 3: Class-wise object distribution for animals in dataset
Model Animal Object Detection Animal Part Detection
Object Detection 87.2
Part Detection 51.3
Joint Object and Part Detection 87.5 52.0
Table 4: mean Average Precision at Intersection over Union threshold 0.5 for Animal object and part detection on PASCAL-Parts validation dataset
Category # Train Samples # Validation Samples
FACE 2933 2909
LEG 5217 5324
NECK 1628 1651
TAIL 1148 1163
TORSO 2978 2974
WINGS 189 175
Table 5: Class-wise part distribution for animals in dataset

For our study, we use the original train-validation splits provided with the dataset. The training data has 4,998 images and validation set has 5,105 images. Table:1 presents the category-wise distribution of object classes in the training set.

For part detection, the training dataset has very fine-grained labels spanning across 166 different classes, with many classes not having enough training samples. Some stats of the distribution of classes in the training data are shown in Table:2. To address the uneven distribution of part labels and for faster experimentation, we work on only objects in the animals category, and collate fine-grained object-part classes into coarser ones, for example, FACE includes beak, hair, head, nose, lear, lebrow, leye, mouth, rear, rebrow, reye. Similarly, WINGS includes lwing, rwing.

Since, out overall task is to jointly model object and part detection, we remove those samples from consideration where for an object, the part information is not available. In Table: 5, we present the distribution of part samples in training and validation sets for the animals category which we are working with.

4 Evaluation Metric

Object Detection models are evaluated using the metric called mean Average Precision (mAP). There are many components involved in computing mAP, which are Average Precision, and Intersection over Union (IoU). IoU measures the overlap between ground truth boundaries and predicted boundaries for an object. A prediction is considered correct if IoU threshold. Average Precision (AP) measures the average of precision values where recall values ranges from 0 to 1. mAP is the mean of AP values for all different classes in the dataset. mAP values are generally reported at IoU threshold of 0.5, but it can also be reported at IoU thresholds [0.5, 0.55, 0.6, …, 0.95] and taken average of all these mAP values.

5 Experiments

Category Object Detection Joint Detection
horse 88.6 88.6
dog 82.0 88.8
cat 85.7 92.6
bird 84.9 86.7
sheep 87.6 84.8
cow 92.4 83.0
Table 6: Class-wise object detection AP at IOU=0.5 for animals in dataset
Category Part Detection Joint Detection
FACE 25.9 83.3
LEG 83.7 53.8
NECK 53.6 33.0
TAIL 33.0 32.1
TORSO 30.6 80.3
WINGS 81.1 29.3
Table 7: Class-wise part detection AP at IOU=0.5 for animals in dataset

For training all models, we use the initial model parameters that are pretrained on MS COCO dataset 333

, which is much larger than PASCAL VOC 2010 dataset and contains about 4 times the number of classes, and replace the last classification and regression layers to make predictions for objects and parts. This transfer learning helps model converge training very quickly in less number of training epochs and achieve better performance. While training, we randomly flip images horizontally with probability 0.5 as a data augmentation technique to make the model more robust.

Figure 6: Sample outputs from our joint trained model. For cleanliness, we have labelled objects and parts separately.
Figure 7: Example output where person parts are also detected when training and evaluating only for animal classes

The joint detection model has about 82m parameters and training takes 12 minutes per epoch. We train the model for 15 epochs using SGD optimizer with initial learning rate 1e-3 (decayed by 0.1 after every 5 epochs), momentum 0.9, weight decay 1e-6, batch size of 1 image, and a fusion threshold of 0.9. Note that for VOC2010, no annotated test data is available so all our results are evaluated on PASCAL VOC 2010 val dataset. For faster experiments, we train and report results only on Animal object (bird, cat, cow, dog, horse, sheep) and part (face, leg, neck, tail, torso, wings) classes. As shown in Table:4, the joint detection model gives slight improvement in terms of mAP@IoU=0.5 for animal object detection, and reasonable improvement for animal part detection. Table:6 and 7 provide a detailed analysis of Average Precision for each of the object and part classes. We believe a thorough hyperparameter tuning can lead to even better results. In Figure 6, we present some sample outputs from our jointly trained model. In (b) we can observe that although the parts of the dog are detected correctly, there are multiple overlapping detections. We present more details on the part detection performance and our experiments with non-max suppression to handle this, in the next subsection.

Note: We also experimented with a naive average of related parts (or objects) instead of weighted average using attention scores for joint detection, but that did not lead to any improvement in performance over 2 separate object and part detection models.

5.1 Discussion on Part Detection performance

As it can be seen in Table:4, part detection performance for animal part classes is very less as compared to object detection performance for single part detection model as well as Joint detection model. The reason for this is because in animals sub-dataset that we used for all experiments, there are many persons as well. Just looking at a leg, the model might get confused between person leg and animal leg and hence would detect both legs. Hence, there will be lots of false part detections for animals class. This can be observed below in Fig:? where person leg and torso are also detected even when the annotations used for training are only for animals objects. We expect this issue to be absent when we train the model for detection of all object and part classes.

Varying Part detection NMS inference threshold

: We observed that there is a large number of overlapping part detections and there is high variance in the areas of their detected bounding boxes. This may be a cause of reduced mAP scores. Hence, we tried varying the NMS threshold for part evaluation of our Joint Detection model and present our finding in Table:

8 showing no improvement in performance with varying the threshold.

NMS Threshold mAP @I0U=0.5
0.1 49.1
0.3 51.0
0.4 51.9
0.45 52.1
0.5 52.0
0.55 51.6
Table 8: NMS Threshold Variation for Part evaluation of Joint Detection Model

6 Future Work

We expected more improvement from the Joint detection model for object and part detection. Hence, we plan to investigate the feature fusion part deeply. We also plan to create attention score visualizations to see if the scores are coming out to be as expected or not. For the above results, we performed all experiments on animal object and part classes, but we also want to see results on all object and part classes. Finally, we will also tune hyperparameters thoroughly for better performance.

7 Links


  • [1] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille (2014) Detect what you can: detecting and representing objects using holistic models and body parts. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1971–1978. Cited by: §1, §2, §3.
  • [2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
  • [3] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask r-cnn. corr abs/1703.06870 (2017). arXiv preprint arXiv:1703.06870. Cited by: §1.
  • [4] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. arXiv. Cited by: §1.
  • [5] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1.
  • [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [7] Q. Yao and X. Gong (2018) Exploiting lstm for joint object and semantic part detection. In Asian Conference on Computer Vision, pp. 498–512. Cited by: §1.