KL-Divergence-Based Region Proposal Network for Object Detection

05/22/2020 ∙ by Geonseok Seo, et al. ∙ 0

The learning of the region proposal in object detection using the deep neural networks (DNN) is divided into two tasks: binary classification and bounding box regression task. However, traditional RPN (Region Proposal Network) defines these two tasks as different problems, and they are trained independently. In this paper, we propose a new region proposal learning method that considers the bounding box offset's uncertainty in the objectness score. Our method redefines RPN to a problem of minimizing the KL-divergence, difference between the two probability distributions. We applied KL-RPN, which performs region proposal using KL-Divergence, to the existing two-stage object detection framework and showed that it can improve the performance of the existing method. Experiments show that it achieves 2.6 Faster R-CNN with VGG-16 and R-FCN with ResNet-101 backbone, respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the field of computer vision, object detection has been an important problem for a long time. Now, object detection using DNN has shown excellent performance and has been used in many industrial fields like face recognition

[22, 19], pedestrian detection [4, 21], and autonomous vehicle [12, 10]. Object detection is mainly divided into the single-stage method and the two-stage method. The first step of the two-stage method is the region proposal. To find region proposal, Faster R-CNN [17] uses RPN and it reduces computation and improves performance compared to methods that do not use neural networks. Currently, RPN based two-stage object detection methods [15, 7, 1, 2, 13] show high performance.

Figure 1: Some cases of objectness score relation with uncertainty of coordinates. In both cases, the objectness score of the blue boxes should be lower than the red boxes considering high uncertainty of coordinate because of occlusion.

For training RPN, two loss functions are used. First is a binary cross-entropy loss that classifies candidate regions as foreground or background. Second is the smooth

loss that used to learn the coordinate offset. It applies to loss for small offset and loss for large offset.

Figure 2:

Network architecture of KL-RPN based Faster R-CNN. KL-RPN predicts the mean and standard deviation of bounding box offset. The target of example is determined based on IoU with ground truth bounding box.

However, there are some things that RPN does not consider. In general, because of occlusion and boundary ambiguities, ground truth bounding boxes have variations in the process of human annotation. In other words, the ground truth bounding box will be expressed as a probability distribution that some probability exists in the corresponding coordinate [9]. Since RPN uses the Smooth loss to learn only bounding box offset, it does not consider the uncertainty, the confidence that the coordinate exists, of the bounding box coordinate. The Smooth

loss is known to be less sensitive to outliers than the normal

loss, but ultimately it is not a regressor that takes uncertainty into account. In order to learn the uncertainty of coordinates, [9]

used KL-Divergence loss and reflected it in the refinement of coordinates in NMS post-processing. In addition, in the existing RPN, two tasks, classification and bounding box regression, are learned independently and they do not concern each other in the training process. Because predicting region proposal is a matter of estimating the bounding box coordinates where the object is likely to exist, they are not independent and the classification should be considered together with bounding box regression for accurate modeling. In terms of the uncertainty of coordinates, Figure 

1 shows the several cases of objectness score relation with the uncertainty of coordinates. In the picture, the blue box has a high uncertainty of coordinates than the red box because of occlusion. In this case, the blue box’s objectness score should be low. Therefore, the objectness score not only distinguishes whether the object is foreground or background but also should be taken into account the uncertainty of the coordinates. From this fact, unlike the existing RPN, we propose a new region proposal learning method named KL-RPN that considers the bounding box offset’s uncertainty in the objectness score by using only a single loss function.

When modeling bounding box offset as a Gaussian distribution, the KL-RPN uses the KL-Divergence, the distance between these two probability distributions, for region proposal learning. Then we gave the probability distribution of each example as the target distribution that the network should learn. A positive example is a Dirac delta function with zero standard deviation and mean as an offset between ground truth bounding box and anchor. A negative example is assumed to be a Guassian distribution with large standard deviation so as to deem uniform distribution which has same probability at all values. Then we use this standard deviation as the objectness score, which applied to the post-processing NMS stage, to apply the bounding box offset’s uncertainty.

We applied our method to Faster R-CNN and R-FCN which are the two most commonly used two-stage object detection methods based on VGG-16 and ResNet-101 backbone and confirmed the performance improvement in PASCAL VOC and MS COCO datasets.

Our main contributions are summarized as follows:

  1. [label=()]

  2. In RPN, objectness score and bounding box offset are trained independently. In KL-RPN, both of them are trained together by minimize KL-Divergence.

  3. In KL-RPN, bounding box offset’s uncertainty is considered as the objectness score.

  4. Our method adds little parameters to the existing RPN, so the computation is almost unchanged and also performance is improved, without bells and whistles.

2 Kl-Rpn

2.1 Target Distribution

Figure 2 is the overall KL-RPN network structure. The network learns the , and of single variate Gaussian on the target distribution of , , , and coordinate offsets. Like the existing RPN [17], the target of example to be learned for each feature map is determined based on IoU (Intersection over Union) with ground truth bounding box. The anchors that IoU with any ground truth bounding box is 0.7 or more and the anchors with the highest IoU with a ground truth bounding box are positive, and negative for IoU with ground truth bounding box less than 0.3. Once the types of examples are determined, as shown in Figure 2, the KL-RPN uses a Dirac delta function with a of zero as a target for positive example and a Gaussian distribution with large standard deviation as a target for negative example. In the negative example, it would be difficult for the network to learn a uniform distribution with an infinite standard deviation. Instead of uniform distribution, we set a Gaussian distribution with larger standard deviation than that of positive example as a target for a negative sample to reflect high uncertainty. The target distribution mean of the positive example is the offset of , , , and as in [6, 17].

2.2 KL-Divergence between Two Gaussian Distribution

In this section, we will introduce , used in KL-RPN learning.

is a loss function based on Kullback-Leibler Divergence. The Kullback-Leibler Divergence, KLD, is a function used to calculate the difference between two probability distributions. The difference between the target probability distribution

and the probability distribution predicted by the network can be obtained by using the KLD. The expression of the KLD is defined as follows.

(1)

Since we want to minimize the difference between the and the , the loss function of the KLD is:

(2)

We assume that the and the have a Gaussian distribution. In this case, the KLD of the two Gaussian distributions is calculated as follows. Let and . From the equation (2), the loss function of the overall KLD is:

(3)

When is a negative sample, we regard it as Gaussian distribution with . And We predicted the as [9] to prevent gradient exploitation of . Considering only the terms related to the gradient, we can rearrange the loss equation as follows.

(4)

In here, we set the .

2.3 KL-Divergence between Dirac Delta Function and Gaussian Distribution

As mentioned above, we regard the distribution of positive examples as a Dirac delta function with a standard deviation of zero as in [9].

(5)

Here, is the offset between the ground truth bounding box and anchor. Now we will discuss the KLD between the Dirac delta function and the Gaussian distribution. In equation (2), the probability exists only at , the KLD is:

(6)

In here, the is entropy of . By leaving the term related to the gradient, the loss equation can be summarized as follows.

(7)

In here, we set the .

2.4 Entire Loss Function

As described in Section 2.1, for each example, the target to be learned is determined based on IoU between ground truth bounding box and anchor. In each case, the loss function is described as follows.

(8)

After the RoIs through KL-RPN are detected, we pool the RoIs from KL-RPN and finally perform classification and regression on each RoIs as in [17, 2]. In this stage, the loss functions are and , respectively. The total loss equation for training the entire object detection network is:

(9)

In here, we set the = 10.

2.5 Objectness Score by using Standard Deviation

We use the network’s standard deviation as the objectness score to apply the uncertainty of the bounding box offset to the objectness score. Like equation (10), the objectness score is defined as reciprocal of multiplication of the standard deviation of a single variate Gaussian whose variables are , , , and .

(10)

Here, is the standard deviation that the network predicted and was set to to prevent division by zero. Following the section 2.2 and 2.3, the for the , , , and offsets will be close to zero for a positive example and close to for a negative example. Therefore, the lower , the better positive sample can be considered. When the RoIs predicted by KL-RPN come out, the classification and regression of the second stage of the [17, 2] are performed by doing the NMS based on the objectness score in the order of bigger. In this part, training is performed in the same manner as in [17, 2].

3 Experiments

We applied KL-RPN to Faster R-CNN and R-FCN, both of which are two-stage object detection method used RPN. [20] was used as the baseline of the experiment, and used VGG-16 [18] and ResNet-101 [8]

backbone pretrained with ImageNet

[3]. PASCAL VOC [5] and MS COCO [14]

dataset were used for training. In all experiments, we used a PyTorch

[16].

Method train data bs lr
Region
Proposal
Backbone AP AP AP AP AP AP Param(M) Speed(ms)
Faster R-CNN trainval35k 16 1e-2 RPN[20] VGG-16 27.1 47.1 27.7 11.5 30.0 36.5 138.31 188
16 6e-5 KL-RPN 29.7 49.9 31.5 11.6 33.0 42.4 138.32 187
16 1e-2 RPN[20] ResNet-101 35.4 56.1 37.9 14.7 39.0 51.4 48.08 253
16 6e-5 KL-RPN 35.6 55.4 38.3 14.3 39.2 52.4 48.09 260
R-FCN trainval 8 1e-3 RPN[2] 29.9 51.9 - 10.8 32.8 45.0 - -
8 4e-5 KL-RPN 31.9 53.9 33.4 13.1 35.0 45.9 - -
Table 1: Comparison of RPN and KL-RPN in COCO test-dev. In [20], we changed the short image size to 600 and train it from scratch. In R-FCN, we use multi-scale training. In the inference, we selected RoIs with top 1000 objectness scores. Speed is measured with GTX 1080Ti. bs: batch size, lr: learning rate.
(a) RPN (b) KL-RPN
Figure 3: The distribution of the , , , and offsets according to the objectness score. Red, and blue line are high, and low objectness score, respectively. For each image, the number of ground truth bounding boxes 5 were sampled in the order of objectness score from high to low, respectively. Base model is VGG-16 with Faster R-CNN. 100 MS COCO minival images were used.

3.1 Training Details

Because of the divergence in the training process, Adam [11] was used as the optimizer with its parameters and being 0.9 and 0.999. In addition, the batch sizes are different for each reference experiment. Therefore, different settings were made, and the contents are shown in Table 1 and Table 2. We train the network on 4 Pascal X (Maxwell) GPUs. RoI-Align [7] is adopted in all experiments. The additional parameters of KL-RPN were initialized to

. And no gradient clip in VGG-16 backbone. The first convolution layer in ResNet-101 are fixed while training. The rest of the experiment setting was conducted in the same manner as

[17, 2].

3.2 Distribution of offset according to the objectness score

In our method, the RoI’s objectness score is defined as the uncertainty of the bounding box offset. To show the change in offset from the ground truth bounding box for different objectness scores, we compare the distribution of the offset for high and low objectness scores. Among RoIs whose IoU is bigger than zero with ground truth bounding box, Figure 3 shows the offset with ground truth bounding box according to the objectness score for , , , and , respectively. In the Figure 3 (a), RPN, the offset from the ground truth bounding box has a similar distribution for the bounding box with both high objectness score and low objectness score. It is because the RPN is trained classification and regression independently and does not consider the uncertainty of the bounding box offset while training. Unlike the results of the existing RPN, the distribution of offset change depending on the size of the objectness score in KL-RPN, Figure 3 (b).

Method
train
data
bs lr
Region
Proposal
Backbone mAP
Faster R-CNN 07+12 1 1e-3 RPN[20] VGG-16 75.9
1 2e-5 KL-RPN 77.1
1 1e-3 RPN[20] ResNet-101 80.2
1 2e-5 KL-RPN 80.7
R-FCN 07 2 4e-3 RPN111https://github.com/princewang1994/RFCNCoupleNet.pytorch 73.8
2 4e-5 KL-RPN 74.6
Table 2: Comparison of RPN and KL-RPN in VOC 2007 test. The short size of input image is fixed to 600. In the inference, we selected RoIs with top 300 objectness scores. 07: 07 trainval. 07+12: 07 trainval + 12 trainval. bs: batch size, lr: learning rate.

3.3 Experiments on MS COCO and PASCAL VOC

Table 1 and Table 2 are experiment results. In PASCAL VOC experiment, using KL-RPN increased the mAP, especially in the VGG-16 backbone. In MS COCO experiment, using KL-RPN with Faster R-CNN, there are 2.6 AP improvement in VGG16 and 0.2 AP improvement in ResNet-101. In addition, the detection performance is improved more on the large objects than the medium. Also, the performance is improved without significant change of parameters and inference time. With R-FCN, the results show that the AP increases 2% compared to the existing RPN, which shows a higher performance improvement than when applied to Faster R-CNN. When we apply our method to the fully convolution method of two-stage object detection, the effect is greater.

4 Conclusion

In this paper, we proposed the new region proposal method which defines region proposal problem as one task using KL-Divergence by considering bounding box offset’s uncertainty in the objectness score. The positive sample is considered a Dirac delta function and the negative sample is considered a Gaussian distribution so that the model with Gaussian distribution minimizes the KL-Divergence between them. By using KL-Divergence loss, the network has the advantage of predicting the standard deviation of the offset from the ground truth bounding box and use it as the objectness score. Experiments show that by applying KL-RPN in existing two-stage object detection, the performance is improved and proved that the existing RPN can be replaced successfully.

References

  • [1] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6154–6162. Cited by: §1.
  • [2] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1, §2.4, §2.5, §3.1, Table 1.
  • [3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.
  • [4] P. Dollár, C. Wojek, B. Schiele, and P. Perona (2009) Pedestrian detection: a benchmark. Cited by: §1.
  • [5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §3.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.1.
  • [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §3.1.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
  • [9] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §1, §2.2, §2.3.
  • [10] X. Hu, X. Xu, Y. Xiao, H. Chen, S. He, J. Qin, and P. Heng (2018)

    SINet: a scale-insensitive convolutional neural network for fast vehicle detection

    .
    IEEE Transactions on Intelligent Transportation Systems 20 (3), pp. 1010–1019. Cited by: §1.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
  • [12] C. Lin, P. Sherryl Santoso, S. Chen, H. Lin, and S. Lai (2017) Fast vehicle detector for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 222–229. Cited by: §1.
  • [13] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1.
  • [14] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.
  • [15] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan (2019) Grid r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7363–7372. Cited by: §1.
  • [16] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §3.
  • [17] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1, §2.4, §2.5, §3.1.
  • [18] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.
  • [19] Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li (2017) Detecting faces using region-based fully convolutional networks. arXiv preprint arXiv:1709.05256. Cited by: §1.
  • [20] J. Yang, J. Lu, D. Batra, and D. Parikh (2017) A faster pytorch implementation of faster r-cnn. Cited by: Table 1, Table 2, §3.
  • [21] S. Zhang, J. Yang, and B. Schiele (2018) Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6995–7003. Cited by: §1.
  • [22] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li (2017) S3fd: single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 192–201. Cited by: §1.