1 Introduction
In the field of computer vision, object detection has been an important problem for a long time. Now, object detection using DNN has shown excellent performance and has been used in many industrial fields like face recognition
[22, 19], pedestrian detection [4, 21], and autonomous vehicle [12, 10]. Object detection is mainly divided into the singlestage method and the twostage method. The first step of the twostage method is the region proposal. To find region proposal, Faster RCNN [17] uses RPN and it reduces computation and improves performance compared to methods that do not use neural networks. Currently, RPN based twostage object detection methods [15, 7, 1, 2, 13] show high performance.For training RPN, two loss functions are used. First is a binary crossentropy loss that classifies candidate regions as foreground or background. Second is the smooth
loss that used to learn the coordinate offset. It applies to loss for small offset and loss for large offset.However, there are some things that RPN does not consider. In general, because of occlusion and boundary ambiguities, ground truth bounding boxes have variations in the process of human annotation. In other words, the ground truth bounding box will be expressed as a probability distribution that some probability exists in the corresponding coordinate [9]. Since RPN uses the Smooth loss to learn only bounding box offset, it does not consider the uncertainty, the confidence that the coordinate exists, of the bounding box coordinate. The Smooth
loss is known to be less sensitive to outliers than the normal
loss, but ultimately it is not a regressor that takes uncertainty into account. In order to learn the uncertainty of coordinates, [9]used KLDivergence loss and reflected it in the refinement of coordinates in NMS postprocessing. In addition, in the existing RPN, two tasks, classification and bounding box regression, are learned independently and they do not concern each other in the training process. Because predicting region proposal is a matter of estimating the bounding box coordinates where the object is likely to exist, they are not independent and the classification should be considered together with bounding box regression for accurate modeling. In terms of the uncertainty of coordinates, Figure
1 shows the several cases of objectness score relation with the uncertainty of coordinates. In the picture, the blue box has a high uncertainty of coordinates than the red box because of occlusion. In this case, the blue box’s objectness score should be low. Therefore, the objectness score not only distinguishes whether the object is foreground or background but also should be taken into account the uncertainty of the coordinates. From this fact, unlike the existing RPN, we propose a new region proposal learning method named KLRPN that considers the bounding box offset’s uncertainty in the objectness score by using only a single loss function.When modeling bounding box offset as a Gaussian distribution, the KLRPN uses the KLDivergence, the distance between these two probability distributions, for region proposal learning. Then we gave the probability distribution of each example as the target distribution that the network should learn. A positive example is a Dirac delta function with zero standard deviation and mean as an offset between ground truth bounding box and anchor. A negative example is assumed to be a Guassian distribution with large standard deviation so as to deem uniform distribution which has same probability at all values. Then we use this standard deviation as the objectness score, which applied to the postprocessing NMS stage, to apply the bounding box offset’s uncertainty.
We applied our method to Faster RCNN and RFCN which are the two most commonly used twostage object detection methods based on VGG16 and ResNet101 backbone and confirmed the performance improvement in PASCAL VOC and MS COCO datasets.
Our main contributions are summarized as follows:

[label=()]

In RPN, objectness score and bounding box offset are trained independently. In KLRPN, both of them are trained together by minimize KLDivergence.

In KLRPN, bounding box offset’s uncertainty is considered as the objectness score.

Our method adds little parameters to the existing RPN, so the computation is almost unchanged and also performance is improved, without bells and whistles.
2 KlRpn
2.1 Target Distribution
Figure 2 is the overall KLRPN network structure. The network learns the , and of single variate Gaussian on the target distribution of , , , and coordinate offsets. Like the existing RPN [17], the target of example to be learned for each feature map is determined based on IoU (Intersection over Union) with ground truth bounding box. The anchors that IoU with any ground truth bounding box is 0.7 or more and the anchors with the highest IoU with a ground truth bounding box are positive, and negative for IoU with ground truth bounding box less than 0.3. Once the types of examples are determined, as shown in Figure 2, the KLRPN uses a Dirac delta function with a of zero as a target for positive example and a Gaussian distribution with large standard deviation as a target for negative example. In the negative example, it would be difficult for the network to learn a uniform distribution with an infinite standard deviation. Instead of uniform distribution, we set a Gaussian distribution with larger standard deviation than that of positive example as a target for a negative sample to reflect high uncertainty. The target distribution mean of the positive example is the offset of , , , and as in [6, 17].
2.2 KLDivergence between Two Gaussian Distribution
In this section, we will introduce , used in KLRPN learning.
is a loss function based on KullbackLeibler Divergence. The KullbackLeibler Divergence, KLD, is a function used to calculate the difference between two probability distributions. The difference between the target probability distribution
and the probability distribution predicted by the network can be obtained by using the KLD. The expression of the KLD is defined as follows.(1)  
Since we want to minimize the difference between the and the , the loss function of the KLD is:
(2)  
We assume that the and the have a Gaussian distribution. In this case, the KLD of the two Gaussian distributions is calculated as follows. Let and . From the equation (2), the loss function of the overall KLD is:
(3)  
When is a negative sample, we regard it as Gaussian distribution with . And We predicted the as [9] to prevent gradient exploitation of . Considering only the terms related to the gradient, we can rearrange the loss equation as follows.
(4) 
In here, we set the .
2.3 KLDivergence between Dirac Delta Function and Gaussian Distribution
As mentioned above, we regard the distribution of positive examples as a Dirac delta function with a standard deviation of zero as in [9].
(5) 
Here, is the offset between the ground truth bounding box and anchor. Now we will discuss the KLD between the Dirac delta function and the Gaussian distribution. In equation (2), the probability exists only at , the KLD is:
(6) 
In here, the is entropy of . By leaving the term related to the gradient, the loss equation can be summarized as follows.
(7) 
In here, we set the .
2.4 Entire Loss Function
As described in Section 2.1, for each example, the target to be learned is determined based on IoU between ground truth bounding box and anchor. In each case, the loss function is described as follows.
(8) 
After the RoIs through KLRPN are detected, we pool the RoIs from KLRPN and finally perform classification and regression on each RoIs as in [17, 2]. In this stage, the loss functions are and , respectively. The total loss equation for training the entire object detection network is:
(9) 
In here, we set the = 10.
2.5 Objectness Score by using Standard Deviation
We use the network’s standard deviation as the objectness score to apply the uncertainty of the bounding box offset to the objectness score. Like equation (10), the objectness score is defined as reciprocal of multiplication of the standard deviation of a single variate Gaussian whose variables are , , , and .
(10) 
Here, is the standard deviation that the network predicted and was set to to prevent division by zero. Following the section 2.2 and 2.3, the for the , , , and offsets will be close to zero for a positive example and close to for a negative example. Therefore, the lower , the better positive sample can be considered. When the RoIs predicted by KLRPN come out, the classification and regression of the second stage of the [17, 2] are performed by doing the NMS based on the objectness score in the order of bigger. In this part, training is performed in the same manner as in [17, 2].
3 Experiments
We applied KLRPN to Faster RCNN and RFCN, both of which are twostage object detection method used RPN. [20] was used as the baseline of the experiment, and used VGG16 [18] and ResNet101 [8]
backbone pretrained with ImageNet
[3]. PASCAL VOC [5] and MS COCO [14]dataset were used for training. In all experiments, we used a PyTorch
[16].Method  train data  bs  lr 

Backbone  AP  AP  AP  AP  AP  AP  Param(M)  Speed(ms)  
Faster RCNN  trainval35k  16  1e2  RPN[20]  VGG16  27.1  47.1  27.7  11.5  30.0  36.5  138.31  188  
16  6e5  KLRPN  29.7  49.9  31.5  11.6  33.0  42.4  138.32  187  
16  1e2  RPN[20]  ResNet101  35.4  56.1  37.9  14.7  39.0  51.4  48.08  253  
16  6e5  KLRPN  35.6  55.4  38.3  14.3  39.2  52.4  48.09  260  
RFCN  trainval  8  1e3  RPN[2]  29.9  51.9    10.8  32.8  45.0      
8  4e5  KLRPN  31.9  53.9  33.4  13.1  35.0  45.9     
3.1 Training Details
Because of the divergence in the training process, Adam [11] was used as the optimizer with its parameters and being 0.9 and 0.999. In addition, the batch sizes are different for each reference experiment. Therefore, different settings were made, and the contents are shown in Table 1 and Table 2. We train the network on 4 Pascal X (Maxwell) GPUs. RoIAlign [7] is adopted in all experiments. The additional parameters of KLRPN were initialized to
. And no gradient clip in VGG16 backbone. The first convolution layer in ResNet101 are fixed while training. The rest of the experiment setting was conducted in the same manner as
[17, 2].3.2 Distribution of offset according to the objectness score
In our method, the RoI’s objectness score is defined as the uncertainty of the bounding box offset. To show the change in offset from the ground truth bounding box for different objectness scores, we compare the distribution of the offset for high and low objectness scores. Among RoIs whose IoU is bigger than zero with ground truth bounding box, Figure 3 shows the offset with ground truth bounding box according to the objectness score for , , , and , respectively. In the Figure 3 (a), RPN, the offset from the ground truth bounding box has a similar distribution for the bounding box with both high objectness score and low objectness score. It is because the RPN is trained classification and regression independently and does not consider the uncertainty of the bounding box offset while training. Unlike the results of the existing RPN, the distribution of offset change depending on the size of the objectness score in KLRPN, Figure 3 (b).
Method 

bs  lr 

Backbone  mAP  

Faster RCNN  07+12  1  1e3  RPN[20]  VGG16  75.9  
1  2e5  KLRPN  77.1  
1  1e3  RPN[20]  ResNet101  80.2  
1  2e5  KLRPN  80.7  
RFCN  07  2  4e3  RPN^{1}^{1}1https://github.com/princewang1994/RFCN CoupleNet.pytorch  73.8  
2  4e5  KLRPN  74.6 
3.3 Experiments on MS COCO and PASCAL VOC
Table 1 and Table 2 are experiment results. In PASCAL VOC experiment, using KLRPN increased the mAP, especially in the VGG16 backbone. In MS COCO experiment, using KLRPN with Faster RCNN, there are 2.6 AP improvement in VGG16 and 0.2 AP improvement in ResNet101. In addition, the detection performance is improved more on the large objects than the medium. Also, the performance is improved without significant change of parameters and inference time. With RFCN, the results show that the AP increases 2% compared to the existing RPN, which shows a higher performance improvement than when applied to Faster RCNN. When we apply our method to the fully convolution method of twostage object detection, the effect is greater.
4 Conclusion
In this paper, we proposed the new region proposal method which defines region proposal problem as one task using KLDivergence by considering bounding box offset’s uncertainty in the objectness score. The positive sample is considered a Dirac delta function and the negative sample is considered a Gaussian distribution so that the model with Gaussian distribution minimizes the KLDivergence between them. By using KLDivergence loss, the network has the advantage of predicting the standard deviation of the offset from the ground truth bounding box and use it as the objectness score. Experiments show that by applying KLRPN in existing twostage object detection, the performance is improved and proved that the existing RPN can be replaced successfully.
References

[1]
(2018)
Cascade rcnn: delving into high quality object detection.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 6154–6162. Cited by: §1.  [2] (2016) Rfcn: object detection via regionbased fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1, §2.4, §2.5, §3.1, Table 1.
 [3] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.
 [4] (2009) Pedestrian detection: a benchmark. Cited by: §1.
 [5] (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §3.
 [6] (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.1.
 [7] (2017) Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §3.1.
 [8] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
 [9] (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §1, §2.2, §2.3.

[10]
(2018)
SINet: a scaleinsensitive convolutional neural network for fast vehicle detection
. IEEE Transactions on Intelligent Transportation Systems 20 (3), pp. 1010–1019. Cited by: §1.  [11] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
 [12] (2017) Fast vehicle detector for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 222–229. Cited by: §1.
 [13] (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1.
 [14] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.
 [15] (2019) Grid rcnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7363–7372. Cited by: §1.
 [16] (2017) Automatic differentiation in pytorch. Cited by: §3.
 [17] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1, §2.4, §2.5, §3.1.
 [18] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.
 [19] (2017) Detecting faces using regionbased fully convolutional networks. arXiv preprint arXiv:1709.05256. Cited by: §1.
 [20] (2017) A faster pytorch implementation of faster rcnn. Cited by: Table 1, Table 2, §3.
 [21] (2018) Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6995–7003. Cited by: §1.
 [22] (2017) S3fd: single shot scaleinvariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 192–201. Cited by: §1.
Comments
There are no comments yet.