In the field of computer vision, object detection has been an important problem for a long time. Now, object detection using DNN has shown excellent performance and has been used in many industrial fields like face recognition[22, 19], pedestrian detection [4, 21], and autonomous vehicle [12, 10]. Object detection is mainly divided into the single-stage method and the two-stage method. The first step of the two-stage method is the region proposal. To find region proposal, Faster R-CNN  uses RPN and it reduces computation and improves performance compared to methods that do not use neural networks. Currently, RPN based two-stage object detection methods [15, 7, 1, 2, 13] show high performance.
However, there are some things that RPN does not consider. In general, because of occlusion and boundary ambiguities, ground truth bounding boxes have variations in the process of human annotation. In other words, the ground truth bounding box will be expressed as a probability distribution that some probability exists in the corresponding coordinate . Since RPN uses the Smooth loss to learn only bounding box offset, it does not consider the uncertainty, the confidence that the coordinate exists, of the bounding box coordinate. The Smooth
loss is known to be less sensitive to outliers than the normalloss, but ultimately it is not a regressor that takes uncertainty into account. In order to learn the uncertainty of coordinates, 
used KL-Divergence loss and reflected it in the refinement of coordinates in NMS post-processing. In addition, in the existing RPN, two tasks, classification and bounding box regression, are learned independently and they do not concern each other in the training process. Because predicting region proposal is a matter of estimating the bounding box coordinates where the object is likely to exist, they are not independent and the classification should be considered together with bounding box regression for accurate modeling. In terms of the uncertainty of coordinates, Figure1 shows the several cases of objectness score relation with the uncertainty of coordinates. In the picture, the blue box has a high uncertainty of coordinates than the red box because of occlusion. In this case, the blue box’s objectness score should be low. Therefore, the objectness score not only distinguishes whether the object is foreground or background but also should be taken into account the uncertainty of the coordinates. From this fact, unlike the existing RPN, we propose a new region proposal learning method named KL-RPN that considers the bounding box offset’s uncertainty in the objectness score by using only a single loss function.
When modeling bounding box offset as a Gaussian distribution, the KL-RPN uses the KL-Divergence, the distance between these two probability distributions, for region proposal learning. Then we gave the probability distribution of each example as the target distribution that the network should learn. A positive example is a Dirac delta function with zero standard deviation and mean as an offset between ground truth bounding box and anchor. A negative example is assumed to be a Guassian distribution with large standard deviation so as to deem uniform distribution which has same probability at all values. Then we use this standard deviation as the objectness score, which applied to the post-processing NMS stage, to apply the bounding box offset’s uncertainty.
We applied our method to Faster R-CNN and R-FCN which are the two most commonly used two-stage object detection methods based on VGG-16 and ResNet-101 backbone and confirmed the performance improvement in PASCAL VOC and MS COCO datasets.
Our main contributions are summarized as follows:
In RPN, objectness score and bounding box offset are trained independently. In KL-RPN, both of them are trained together by minimize KL-Divergence.
In KL-RPN, bounding box offset’s uncertainty is considered as the objectness score.
Our method adds little parameters to the existing RPN, so the computation is almost unchanged and also performance is improved, without bells and whistles.
2.1 Target Distribution
Figure 2 is the overall KL-RPN network structure. The network learns the , and of single variate Gaussian on the target distribution of , , , and coordinate offsets. Like the existing RPN , the target of example to be learned for each feature map is determined based on IoU (Intersection over Union) with ground truth bounding box. The anchors that IoU with any ground truth bounding box is 0.7 or more and the anchors with the highest IoU with a ground truth bounding box are positive, and negative for IoU with ground truth bounding box less than 0.3. Once the types of examples are determined, as shown in Figure 2, the KL-RPN uses a Dirac delta function with a of zero as a target for positive example and a Gaussian distribution with large standard deviation as a target for negative example. In the negative example, it would be difficult for the network to learn a uniform distribution with an infinite standard deviation. Instead of uniform distribution, we set a Gaussian distribution with larger standard deviation than that of positive example as a target for a negative sample to reflect high uncertainty. The target distribution mean of the positive example is the offset of , , , and as in [6, 17].
2.2 KL-Divergence between Two Gaussian Distribution
In this section, we will introduce , used in KL-RPN learning.
is a loss function based on Kullback-Leibler Divergence. The Kullback-Leibler Divergence, KLD, is a function used to calculate the difference between two probability distributions. The difference between the target probability distributionand the probability distribution predicted by the network can be obtained by using the KLD. The expression of the KLD is defined as follows.
Since we want to minimize the difference between the and the , the loss function of the KLD is:
We assume that the and the have a Gaussian distribution. In this case, the KLD of the two Gaussian distributions is calculated as follows. Let and . From the equation (2), the loss function of the overall KLD is:
When is a negative sample, we regard it as Gaussian distribution with . And We predicted the as  to prevent gradient exploitation of . Considering only the terms related to the gradient, we can rearrange the loss equation as follows.
In here, we set the .
2.3 KL-Divergence between Dirac Delta Function and Gaussian Distribution
As mentioned above, we regard the distribution of positive examples as a Dirac delta function with a standard deviation of zero as in .
Here, is the offset between the ground truth bounding box and anchor. Now we will discuss the KLD between the Dirac delta function and the Gaussian distribution. In equation (2), the probability exists only at , the KLD is:
In here, the is entropy of . By leaving the term related to the gradient, the loss equation can be summarized as follows.
In here, we set the .
2.4 Entire Loss Function
As described in Section 2.1, for each example, the target to be learned is determined based on IoU between ground truth bounding box and anchor. In each case, the loss function is described as follows.
After the RoIs through KL-RPN are detected, we pool the RoIs from KL-RPN and finally perform classification and regression on each RoIs as in [17, 2]. In this stage, the loss functions are and , respectively. The total loss equation for training the entire object detection network is:
In here, we set the = 10.
2.5 Objectness Score by using Standard Deviation
We use the network’s standard deviation as the objectness score to apply the uncertainty of the bounding box offset to the objectness score. Like equation (10), the objectness score is defined as reciprocal of multiplication of the standard deviation of a single variate Gaussian whose variables are , , , and .
Here, is the standard deviation that the network predicted and was set to to prevent division by zero. Following the section 2.2 and 2.3, the for the , , , and offsets will be close to zero for a positive example and close to for a negative example. Therefore, the lower , the better positive sample can be considered. When the RoIs predicted by KL-RPN come out, the classification and regression of the second stage of the [17, 2] are performed by doing the NMS based on the objectness score in the order of bigger. In this part, training is performed in the same manner as in [17, 2].
backbone pretrained with ImageNet. PASCAL VOC  and MS COCO 
dataset were used for training. In all experiments, we used a PyTorch.
3.1 Training Details
Because of the divergence in the training process, Adam  was used as the optimizer with its parameters and being 0.9 and 0.999. In addition, the batch sizes are different for each reference experiment. Therefore, different settings were made, and the contents are shown in Table 1 and Table 2. We train the network on 4 Pascal X (Maxwell) GPUs. RoI-Align  is adopted in all experiments. The additional parameters of KL-RPN were initialized to
. And no gradient clip in VGG-16 backbone. The first convolution layer in ResNet-101 are fixed while training. The rest of the experiment setting was conducted in the same manner as[17, 2].
3.2 Distribution of offset according to the objectness score
In our method, the RoI’s objectness score is defined as the uncertainty of the bounding box offset. To show the change in offset from the ground truth bounding box for different objectness scores, we compare the distribution of the offset for high and low objectness scores. Among RoIs whose IoU is bigger than zero with ground truth bounding box, Figure 3 shows the offset with ground truth bounding box according to the objectness score for , , , and , respectively. In the Figure 3 (a), RPN, the offset from the ground truth bounding box has a similar distribution for the bounding box with both high objectness score and low objectness score. It is because the RPN is trained classification and regression independently and does not consider the uncertainty of the bounding box offset while training. Unlike the results of the existing RPN, the distribution of offset change depending on the size of the objectness score in KL-RPN, Figure 3 (b).
3.3 Experiments on MS COCO and PASCAL VOC
Table 1 and Table 2 are experiment results. In PASCAL VOC experiment, using KL-RPN increased the mAP, especially in the VGG-16 backbone. In MS COCO experiment, using KL-RPN with Faster R-CNN, there are 2.6 AP improvement in VGG16 and 0.2 AP improvement in ResNet-101. In addition, the detection performance is improved more on the large objects than the medium. Also, the performance is improved without significant change of parameters and inference time. With R-FCN, the results show that the AP increases 2% compared to the existing RPN, which shows a higher performance improvement than when applied to Faster R-CNN. When we apply our method to the fully convolution method of two-stage object detection, the effect is greater.
In this paper, we proposed the new region proposal method which defines region proposal problem as one task using KL-Divergence by considering bounding box offset’s uncertainty in the objectness score. The positive sample is considered a Dirac delta function and the negative sample is considered a Gaussian distribution so that the model with Gaussian distribution minimizes the KL-Divergence between them. By using KL-Divergence loss, the network has the advantage of predicting the standard deviation of the offset from the ground truth bounding box and use it as the objectness score. Experiments show that by applying KL-RPN in existing two-stage object detection, the performance is improved and proved that the existing RPN can be replaced successfully.
Cascade r-cnn: delving into high quality object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §1.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1, §2.4, §2.5, §3.1, Table 1.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §3.
-  (2009) Pedestrian detection: a benchmark. Cited by: §1.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §3.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §3.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
-  (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §1, §2.2, §2.3.
SINet: a scale-insensitive convolutional neural network for fast vehicle detection. IEEE Transactions on Intelligent Transportation Systems 20 (3), pp. 1010–1019. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
-  (2017) Fast vehicle detector for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 222–229. Cited by: §1.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.
-  (2019) Grid r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7363–7372. Cited by: §1.
-  (2017) Automatic differentiation in pytorch. Cited by: §3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1, §2.4, §2.5, §3.1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.
-  (2017) Detecting faces using region-based fully convolutional networks. arXiv preprint arXiv:1709.05256. Cited by: §1.
-  (2017) A faster pytorch implementation of faster r-cnn. Cited by: Table 1, Table 2, §3.
-  (2018) Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6995–7003. Cited by: §1.
-  (2017) S3fd: single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision, pp. 192–201. Cited by: §1.