One of the major and fundamental challenges in object detection is to increase localization accuracy, which indicates the detector’s ability to predict correct regions of target objects. The metric is typically measured by the bounding box overlap, i.e., the intersection over union (IoU) of the ground truth and predicted bounding boxes. While previous challenges (e.g, PASCAL VOC everingham2010pascal and KITTI Geiger2012CVPR ) normally requires an IoU threshold of 0.5 to be considered a correct detection, real-world applications usually call for a higher accuracy (e.g, IoU 0.7). For example, the vehicle and pedestrian detection in autonomous driving need an accurate measurement of distance through real-time road traffic captures.
Recent literature has focused on the modification of region-based detection models at the post-recognition level to boost the localization accuracy felzenszwalb2010object ; gidaris2016locnet ; gidaris2015object
. However, limited work has addressed the problem from a data perspective. Data is important. The rapid advancement in the data collection, storage, and processing technology has made machine learning, especially deep learning, much easier by lightening the burden of generalizing well to unseen data with a limited number of training dataGoodfellow-et-al-2016 .
However, the challenge of learning from imbalanced data he2009learning still exists. Within the "Recognition Using Regions" paradigm gu2009recognition , the training set of object detection is divided into two distinct groups of annotated objects and background regions, and the number of examples in these groups experience a huge imbalance. Online Hard Example Mining (OHEM) shrivastava2016training is proposed to overcome the data imbalance by integrating bootstrapping technique Sung:1996:LES:929901 with region-based detectors, and can be effortlessly implemented on most of the region-based detectors.
In this paper, we propose S-OHEM, the Stratified Online Hard Example Mining algorithm for training region-based deep convolutional network detectors to enhance localization accuracy, as shown in Fig. 1. The intuition of our method is that feeding hard examples to the backpropagation process could overcome the dilemma of unbalanced data, resulting in a more efficient and effective training process shrivastava2016training . In the field of object detection, the hard example is defined as region proposal with higher training loss. Thus, previous hard example mining method (e.g, OHEM) is conducted by sampling region proposals according to a distribution that favors high loss instances. However, the training loss defined in previous work is the multitask loss with equal weight settings across all loss types (e.g, classification, localization, mask he2017mask , or rigid categories and non-rigid categories). This approach ignores the influence of different loss types throughout the training process, which we found essential to the training efficacy (e.g, localization loss is more important during the latter part of the training period). Therefore, maintaining a sampling distribution according to this influence during hard example mining should enhance the performance of object detectors.
S-OHEM exploits stratified sampling, a sampling method involving the division of a population into distinct groups known as strata li2016dss (homogeneous subgroups, in which the inner items are similar to each other). During each mini-batch iteration, S-OHEM firstly assigns candidate examples (in the form of Region of Interests, RoIs) to different strata by the ratio between classification and localization loss. Then the RoIs are subsampled according to a dynamic distribution and fed into the backpropagation process. With an increasing focus on the localization loss, S-OHEM can predict more accurate bounding boxes and therefore enhance the localization accuracy. We apply S-OHEM to the standard Fast R-CNN girshick2015fast and Faster R-CNN ren2015faster detection method and evaluate it on PASCAL VOC 2007 and KITTI datasets. Our systematic experimental analysis reports that S-OHEM yields some AP improvements of 0.5% on rigid categories of PASCAL VOC 2007 for both the IoU thresholds of 0.6 and 0.7. For KITTI 2012, both results of the same metric are 1.6%. Regarding the mAP, a relative increase of 0.3% and 0.5% (1% and 0.5%) is observed for VOC07 (KITTI12) with the same set of IoU threshold.
The remainder of this paper is structured as follows. In Sect. 2, we compare our work with related research with a focus on the improvement of localization accuracy and the use of data in object detection. In Sect. 3, we describe the design of the algorithm. In Sect. 4, we show the experimental results, and in Sect. 5, we conclude this work.
2 Related Work
Object detection has significantly benefited from the advancement of image classification task. The remarkable feature extraction ability of Deep Convolutional Networkskrizhevsky2012imagenet ; szegedy2015going ; SimonyanZ14a ; he2016deep ; szegedy2016rethinking
has equipped us with abundant information for the classification of region proposals. In addition, the continuously developing practical strategies (e.g., activation functionsXuWCL15 ; nair2010rectified ; he2015delving , regularization srivastava2014dropout ; Sung:1996:LES:929901 ; IoffeS15 , and optimization duchi2011adaptive ; hinton2012neural ; KingmaB14
) further contribute to the efficacy of deep neural networks.
Several region-based detectors depend on the strong classification capability of deep convolutional networks to evaluate generated RoIs. R-CNN is the first to adopt this approach by evaluating each RoI separately. Fast R-CNN girshick2015fast improved this method by allowing computation sharing through projecting RoIs to a shared feature map (called RoIPool layer, derived from SPPnets he2014spatial ), resulting in better speed and accuracy. It was then integrated with the region proposal module (the Region Proposal Network, RPN) by sharing their convolutional features and extended to a unified network with "attention" BahdanauCB14 mechanism, leading to further speedup and accuracy enhancement. R-FCN li2016r
eliminates the fully-connected layers of region-based detectors and turns the whole model fully convolutional with the backbones of state-of-the-art image classifiershe2016deep ; szegedy2016rethinking to fully share computation, contributing to a significant speedup. Mask R-CNN he2017mask , which adds a small Fully Convolutional Network (FCN) long2015fully as a parallel branch to standard Faster R-CNN and replaces the RoIPool layer with the RoIAlign layer, is the latest descendant of this stream and achieves significant advancement in several benchmarks of both the detection and segmentation tasks. However, most of these models use the multitask loss with equal weight settings without considering the influence of different loss type throughout the training process.
Recent work has focused on the post-recognition level of region-based detection models to boost the localization accuracy. Gidaris et al. gidaris2015object proposed a CNN-model for bounding box regression, which is used with iterative localization and bounding box voting. LocNet gidaris2016locnet
aims to enhance the localization accuracy by assigning a probability to each border of a loosely localized search region for being related to the object’s bounding box. It’s different from the bounding box regression approachesfelzenszwalb2010object adopted by most of the aforementioned region-based detectors and can be served as an effective alternative.
However, little work has focused on the advancement of region-based detectors from a data perspective. Online Hard Example Mining (OHEM) shrivastava2016training integrates bootstrapping Sung:1996:LES:929901 (or hard example mining) with region-based detectors for a small extra computational cost, but still lacks enough focus on the localization accuracy because of the derived multi-task loss imbalance. Further discussion is available in Sect. 3.
3 Model Design
In this section, we argue that the current way of choosing hard examples lacks enough focus on localization accuracy and is suboptimal, and we will show that our approach results in better training (lower training loss), higher localization performance, and higher average precision. Firstly, we discuss the design motivation. Then we give a brief introduction of stratified sampling and definition of stratified constraint in this work. Finally, we present the design and implementation of our Stratified Online Hard Example Mining algorithm (S-OHEM).
Most of the region-based detectors derive the multitask learning from Fast R-CNN, and assume equal contributions of classification loss and localization loss throughout the training process. However, this assumption is not often the case. We apply the original OHEM on standard Fast R-CNN and Faster R-CNN, then report the classification and localization loss throughout the training process on PASCAL VOC and KITTI datasets separately.
As is illustrated in Fig. 2, the classification loss is consistently larger than the localization loss (more than double in average). But this could result in a problem. Let’s consider a situation where we have two region proposals RoI and RoI and are asked to choose one as the hard example for backpropagation. Based on the preliminary experiment result shown in Fig. 2, we make a common assumption that the training loss for RoI and RoI is , , and , respectively. Recall that the classification loss is defined as log loss log for true class girshick2015fast , and thus the probability for the true class is 61.5% and 64.5% for RoI and RoI respectively. It’s not a significant gap of the class prediction probability between these two RoIs, and we can believe they have similar performance for the classification task.
Regarding the localization loss, the gap between RoI and RoI is 0.01 . Within the smooth loss settings girshick2015fast , this gap turns to a 0.14 difference between the bounding boxes of ground truth and prediction. Note that this gap is quite significant when we use the parameterization for bounding box offsets given in girshick2014rich , and therefore we are supposed to choose RoI as the hard example for better localization accuracy and prediction quality. However, within the equal-weight multitask loss settings, RoI will be chosen as the hard one. Thus, the previous hard example mining approach lacks focus on localization accuracy.
3.2 Stratified Sampling
Stratified sampling is a sampling method involving the division of a population into distinct groups known as strata li2016dss . These strata are homogeneous subgroups of the original data with similar inner items. Stratified sampling can get higher statistical precision because the variability within subgroups sharing the same properties is lower than that of the entire population thompson2012stratifiedsampling . Therefore. stratified sampling improves the representativeness by reducing sampling error.
Each stratum constraint is denoted by , where is a propositional formula and is the required sample size. In this work, the four stratum constraint is defined by the ratio between classification loss () and localization loss: high and high , high and low , low and high , and low and low . The required sample size and threshold of high loss (hard examples) change dynamically throughout the training process.
3.3 Stratified Online Hard Example Mining algorithm
Given the observation that the previous hard example mining approach ignores the influence of different loss types throughout the training process and lacks focus on localization accuracy, we now demonstrate our approach of Stratified Online Hard Example Mining (S-OHEM).
The architecture of S-OHEM is shown in Fig 1. In each mini-batch iteration, S-OHEM firstly generates region proposals of the input images, forward-propagates all of them across the region-based detector, and gathers the training loss of each RoI. Then each RoI is assigned to one of the four strata defined in Sect. 3.2. Different loss type combinations represent how well the current detector performs in classification and localization tasks on each RoI respectively. Inside each stratum, hard examples are chosen by sorting the RoIs by loss. After that, all RoIs are subsampled according to a dynamic distribution, and a total number of hard examples are fed into the backpropagation process. The sampling distribution of RoIs from each stratum changes dynamically throughout the learning process, as each loss type maintains different contribution to the detector model at different training stages. Specifically, the effect of classification loss is more important in the beginning, while the localization loss contributes more at later training stages.
For implementation, we keep a record of history training loss and start to change the sampling distribution when the loss becomes stable (e.g., after 40K iterations shown in Fig 2). At the beginning of training, we only sample the first RoIs with high (i.e., sample from , the union of strata and ). When loss becomes stable, we gradually focus on choosing the RoIs with high (i.e., sample from the union of and , denoted by ) by increasing the sampling ratio between and . Because of the gradually increasing focus on the localization loss, S-OHEM can predict more accurate bounding boxes and thus enhance the localization accuracy.
An equivalent alternative is available. To make it simple, we denote the contribution coefficient of and to hard example selection by and respectively. And our approach aims to find the optimal value of and in Formula (1) at different training stages. is only for hard example mining, and the actual loss backpropagated across the network will not be affected.
When training begins, we only sample the first RoIs with high by setting and in Formula (1) to 1 and 0 respectively. When loss becomes stable, we gradually focus on choosing the RoIs with high by gradually decreasing the value of and increasing in Formula (1).
S-OHEM will not have a significant influence on the training time because most of the forward computation is shared between RoIs girshick2015fast , and the number of backpropagated examples is much smaller than that of all region proposals of the input images. To overcome co-located RoIs and loss double counting, we follow the solution of shrivastava2016training and apply non-maximum suppression (NMS) neubeck2006efficient to perform deduplication before the sampling procedure. NMS works by finding the highest loss RoI, and eliminating all other RoIs with lower loss and high overlap with the selected region. Besides, we derive their method of maintaining a read-only RoI network and a standard RoI network with sharing weights for efficient memory allocation. It is also worth noting that S-OHEM can be combined with any post-recognition regressors introduced in Sect. 2, because it focuses on enhancing the localization accuracy from the data perspective.
4 Experiments and Results
In this section, we conduct systematic experiments to evaluate the proposed S-OHEM and compare it with original OHEM. We describe the experimental setup in Sect. 4.1, and demonstrate the efficiency and accuracy of the algorithm by examining the training loss and average precision.
4.1 Experimental Setup
We use the standard and popular CNN architecture VGG16 from SimonyanZ14a , and evaluate the algorithms on the PASCAL VOC 2007 and KITTI Object Detection Evaluation 2012 dataset. In the PASCAL VOC experiment, training is done on the trainval set and testing on the test set. In the KITTI 2012 experiment, we use the first 5000 images to form the training set and the remaining 2481 images for testing. All models are trained with SGD for 80k mini-batch iterations and followed the same setup from Sect. 4.1. For average precision, we report the results with IoU thresholds of 0.5, 0.6, and 0.7, to evaluate the localization accuracy in a wider range of IoU thresholds. We use Fast R-CNN girshick2015fast as the detector base for our PASCAL VOC experiment, and Faster R-CNN ren2015faster for the KITTI 2012 experiment, to prove the usability of our approach. The initial learning rate is set to 0.001 and dropped in "steps" by a factor of 0.1 every 30K iterations. We process 2 images in each mini-batch iteration, and subsample 128 RoIs to feed them into backpropagation. Note that the baseline of OHEM reported in Table 2 (row 1-2) were reproduced and are slightly higher than the ones reported in shrivastava2016training .
For both experiments, we follow the procedure described in Sect. 3.3 to control the contribution coefficient of and . In the beginning, and are set to 1 and 0 when training starts. Then we gradually increase to the ratio between classification and localization loss when the loss becomes stable. Specifically, will increase to 1.9 and 2.3 for the VOC07 and KITTI12 experiment respectively.
4.2 Results and Analysis
4.2.1 Training Convergence.
We firstly analyze the training loss for both methods by logging the average training loss every 10K steps. Figure 3 shows the average loss per RoI for VGG16 with settings presented in Sect. 4.1. We see that S-OHEM yields lower loss in both classification and localization than the original OHEM, validating our claims that S-OHEM leads to better training than OHEM. Also, the results indicate that S-OHEM is better in classification confidence and localization accuracy during the training process.
4.2.2 Voc 2007.
Table 1 shows that on VOC07, S-OHEM improves the mAP of OHEM from 71% to 71.1% for an IoU threshold of 0.5, and an improvement of 0.4% and 0.3% for IoU 0.6 and 0.7 respectively. For category-specific improvements, S-OHEM performs well in most of the rigid categories (bold categories in Table 1) across all three IoU thresholds, especially for IoU 0.7.
As is listed on Table 3(a), we compute the mAP among rigid categories and show increase of 0.1%, 0.5%, and 0.5% for IoU 0.5, 0.6, and 0.7 respectively. It’s also interesting to find that S-OHEM performs quite well in detecting cats for IoU threshold 0.6, which indicates the better bounding boxes generated by S-OHEM in this environment.
4.2.3 Kitti 2012.
The evaluation results on KITTI 2012 is shown in Table 2. S-OHEM improves the mAP of OHEM from 63.9% to 64.9% for an IoU threshold of 0.6, and an improvement of 0.5% for IoU 0.7. We also compute the mAP among rigid categories and list results in Table 3(b). Note that the Note that the misc category is classified as rigid based on our observation of the dataset. We show some increase of 1.6% for both IoU thresholds 0.6 and 0.7.
4.2.4 Rigid and Non-Rigid Category.
Our experimental results have shown that S-OHEM performs quite well on rigid categories of both the VOC07 and KITTI12 dataset. The reason is that rigid bodies can reach better classification accuracy on pre-trained deep convolutional networks ascribed to its strong resistance to deformation. Therefore, the influence of different loss distribution throughout the training process (as described in Sect. 3.1) is more likely to happen on rigid bodies. Also, the border distribution of rigid bodies is more similar to each other and is thus easier to learn.
In this paper, we proposed Stratified Online Hard Example Mining (S-OHEM) algorithm, a simple and effective method for training region-based deep convolutional network detectors to enhance localization accuracy. During hard example mining, S-OHEM exploits stratified sampling and focuses on the influence of different loss types throughout the training process. Experimental analysis shows that S-OHEM outperforms OHEM regarding training convergence and localization accuracy, and achieves some AP improvements on rigid categories of PASCAL VOC 2007 and KITTI 2012. Besides, S-OHEM addresses the localization enhancing problem merely from the data perspective and can be easily plugged into existing region-based detectors. Furthermore, the state-of-the-art Mask R-CNN he2017mask also derives the equal-weight multi-task loss with an addition task of semantic segmentation, which is improvable through S-OHEM. S-OHEM can also be applied to other multi-task loss, including the loss of semantic segmentation, key-point detection, etc.
- (1) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
- (2) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman.
The pascal visual object classes (voc) challenge.
International journal of computer vision, 88(2):303–338, 2010.
- (4) P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
A. Geiger, P. Lenz, and R. Urtasun.
Are we ready for autonomous driving? the kitti vision benchmark
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- (6) S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pages 1134–1142, 2015.
- (7) S. Gidaris and N. Komodakis. Locnet: Improving localization accuracy for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 789–798, 2016.
- (8) R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
- (9) R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
- (10) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- (11) C. Gu, J. J. Lim, P. Arbeláez, and J. Malik. Recognition using regions. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1030–1037. IEEE, 2009.
- (12) H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
- (13) K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. arXiv preprint arXiv:1703.06870, 2017.
- (14) K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pages 346–361. Springer, 2014.
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- (16) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- (17) G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini–batch gradient descent. 2012.
- (18) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pages 448–456. JMLR.org, 2015.
- (19) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pages 1097–1105, 2012.
- (21) M. Li, D. Li, S. Shen, Z. Zhang, and X. Lu. Dss: A scalable and efficient stratified sampling algorithm for large-scale datasets. In IFIP International Conference on Network and Parallel Computing, pages 133–146. Springer, 2016.
- (22) Y. Li, K. He, J. Sun, et al. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems, pages 379–387, 2016.
- (23) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- (24) V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
- (25) A. Neubeck and L. Van Gool. Efficient non-maximum suppression. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 3, pages 850–855. IEEE, 2006.
- (26) S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- (27) A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
- (28) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- (29) N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- (30) K. K. Sung. Learning and Example Selection for Object and Pattern Detection. PhD thesis, Cambridge, MA, USA, 1996. AAI0800657.
- (31) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
- (32) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016.
- (33) S. K. Thompson. Stratified Sampling, pages 139–156. John Wiley & Sons, Inc., 2012.
- (34) B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853.