C-RPNs: Promoting Object Detection in real world via a Cascade Structure of Region Proposal Networks

Recently, significant progresses have been made in object detection on common benchmarks (i.e., Pascal VOC). However, object detection in real world is still challenging due to the serious data imbalance. Images in real world are dominated by easy samples like the wide range of background and some easily recognizable objects, for example. Although two-stage detectors like Faster R-CNN achieved big successes in object detection due to the strategy of extracting region proposals by region proposal network, they show their poor adaption in real-world object detection as a result of without considering mining hard samples during extracting region proposals. To address this issue, we propose a Cascade framework of Region Proposal Networks, referred to as C-RPNs. The essence of C-RPNs is adopting multiple stages to mine hard samples while extracting region proposals and learn stronger classifiers. Meanwhile, a feature chain and a score chain are proposed to help learning more discriminative representations for proposals. Moreover, a loss function of cascade stages is designed to train cascade classifiers through backpropagation. Our proposed method has been evaluated on Pascal VOC and several challenging datasets like BSBDV 2017, CityPersons, etc. Our method achieves competitive results compared with the current state-of-the-arts and all-sided improvements in error analysis, validating its efficacy for detection in real world.



There are no comments yet.


page 4

page 17


Learning Chained Deep Features and Classifiers for Cascade in Object Detection

Cascade is a widely used approach that rejects obvious negative samples ...

CRAFT Objects from Images

Object detection is a fundamental problem in image understanding. One po...

ProbaNet: Proposal-balanced Network for Object Detection

Candidate object proposals generated by object detectors based on convol...

Cascade Region Proposal and Global Context for Deep Object Detection

Deep region-based object detector consists of a region proposal step and...

Fine-Grained Object Detection over Scientific Document Images with Region Embeddings

We study the problem of object detection over scanned images of scientif...

Objectness Scoring and Detection Proposals in Forward-Looking Sonar Images with Convolutional Neural Networks

Forward-looking sonar can capture high resolution images of underwater s...

Hierarchical Context Embedding for Region-based Object Detection

State-of-the-art two-stage object detectors apply a classifier to a spar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is a most fundamental step in visual understanding, which aims at identifying and localizing objects of certain categories in images. To promote the development of object detection, plenty of benchmarks have been developed, i.e., PASCAL VOC Everingham et al. (2010) and MS COCO Lin et al. (2014). Most of object detection approaches are trained and tested on these common object detection benchmarks, which typically assume that objects in images are with good visibility and balance. Obviously, this assumption is usually not satisfied in real world.

Taking littoral bird images from developed benchmarks and wild scenes as examples, the former are usually collected with better visibility, while the latter are collected via monitoring cameras with different background and camera distance. Moreover, different illumination and weather conditions may appear in wild scenes. For more intuitive observation, several examples of littoral birds are illustrated in Figure 1. The image from BSBDV 2017 Guan et al. (2018) shows birds from wild scenes, while images from PASCAL VOC Everingham et al. (2010) show birds from common benchmarks. The image from BSBDV 2017 is with resolution of 4912*3264, in which the heights of birds vary from 80 to 300 pixels. Images from PASCAL VOC 2007 and 2012 are with average resolution of 400*400, where the heights of birds are from 150 to 480 pixels. Apparently, the easily recognizable background in wild scenes take more prominent position compared with that in common benchmarks. Besides, bird objects obtained from the wild scenes are with smaller sizes and less texture information. For object detection techniques, such a distribution mismatch from common benchmarks to real world have been observed to lead to a significant performance degradation.

Although enriching training data could possibly alleviate the performance degradation, it is not favored since annotating data is expensive and time consuming. Therefore, developing object detectors towards real world is desirable. To figure out the crucial elements of performance degradation in real-world object detection, plenty of experiments have been conducted. We list the conclusions as follows:

  1. Data imbalance frequently occurs in real world. From an image in real world, the number of negative samples (also called background samples) is much larger than that of positive samples (As shown in Figure 1), and most of them are easy samples. Easy samples do not contribute useful learning information during training while hard samples benefit the convergence and the detection accuracy. Thus, the overwhelming number of easy samples during training leads to moronic classifiers and degenerate models.

  2. As mentioned above, because of the smaller size, poor shooting conditions and poor abundance of objects in real-world scenes, classifiers in detection algorithms are unable to learn discriminative features from ground truth.

In this work, we aim to improve the precision of object detection in real world. Based on observations above, mining hard samples from abundant easy samples for training is a crucial route to address this issue. Based on the brilliant object detector Faster R-CNN Ren et al. (2017), we firstly propose a cascade framework of region proposal networks, referred to as C-RPNs. While extracting region proposals, C-RPNs are adopted to mine hard samples and learn stronger classifiers. Multi-stage classifiers at early stages discard most of easy samples so that classifiers at latter stages focus on handling hard samples. Also, we design a feature chain and a score chain to generate more discriminative representations for proposals. Finally, a loss function of cascade stages is built to jointly learn cascade classifiers.

The contributions of this work are summarized as follows:

  • Based on the Faster R-CNN, a cascade structure of region proposal networks for object detection was firstly proposed, referred to as C-RPNs.

  • A feature chain and a score chain were designed in C-RPNs to further improve the classification capacity of multi-stage classifiers.

  • A loss function of multi-stage was constructed to jointly learn cascade classifiers.

  • Integrating the proposed components into the Faster R-CNN, our resulting model can be trained end-to-end.

Extensive experiments have been conducted on several datasets, including PASCAL VOC Everingham et al. (2010) , BSBDV 2017 Guan et al. (2018), Caltech Pedestrian Benchmark Dollar et al. (2009) and CityPersons Zhang et al. (2017). Our approach have provided competitive performance compared with the current state-of-the-arts. Besides, error analyses have shown that our approach achieved all-sided improvements compared with the baseline Faster R-CNN. The experimental results demonstrate the effectiveness of our proposed approach for object detection in real world.

Figure 1: Examples: (1) 12 images from Pascal VOC (left upper); (2) one littoral bird image from BSBDV 2017 (middle upper); (3) The bird objects drawn from these images (bottom); (4) The easy and hard samples listed randomly from the real-world image (right). It is clearly that the image from realistic scenes is dominated by easy samples, especially easy negative samples. Besides, the scales and abundances of birds are mismatched from common benchmarks to realistic scenes, Best viewed in color.

2 Related work

2.1 Related Work On Object Detection

We all have witnessed tremendous progresses in object detection using convolutional neural networks (CNNs) in recent years

Girshick et al. (2014); Girshick (2015); Ren et al. (2017); Dai et al. (2016); Gidaris and Komodakis (2015); Liu et al. (2016); Redmon et al. (2016). Region-based CNN approaches Girshick et al. (2014); Girshick (2015); Ren et al. (2017) are referred as two-stage detectors, which have received great attention due to their effectiveness. At the outset, R-CNN Girshick et al. (2014) was constrained by a selected search region. To reduce the computational complexity of R-CNN, Fast R-CNN Girshick (2015) shared the convolutional feature maps among region of interest (RoI) and accelerated spatial pyramid pooling using RoI pooling layer. Renetal Ren et al. (2017) introduced Region Proposal Network (RPN) to generate high-quality region proposals and then merged them with Fast R-CNN into a single network, referred to as Faster R-CNN. Besides, for faster detection, one-stage detectors such as YOLO Redmon et al. (2016) and SSD Liu et al. (2016) were proposed to accomplish detection without region proposals, although this strategy reduced the detection performance. Researches showed that Faster R-CNN achieved a big success in object detection and laid the foundation for many follow-up works Gidaris and Komodakis (2015); Lin et al. (2017); Zhang et al. (2016); Dai et al. (2017). For example, feature pyramid and fusion operations were adopted Lin et al. (2017) to enhanced precision of detection. Deeper Simonyan and Zisserman (2014); He et al. (2016); Szegedy et al. (2015) or wider Bell et al. (2016); Zagoruyko and Komodakis (2016) networks also benefited the performance. Deformable CNN Dai et al. (2017) and Receptive Field Block Net Liu et al. (2018a) enhanced the convolutional features using deformable convolutional operation and Receptive Field Block respectively. In addition, using large batch size Peng et al. (2018) during training provided improvement in detection. SIN Liu et al. (2018b) jointly used scene context and object relationships to promote detection performance.

Although reasonable detection performances have been achieved on benchmarks like PASCAL VOC Everingham et al. (2010) and MS COCO Lin et al. (2014), object detection in real world still suffers from poor precision. Works mentioned above mostly focused on the conventional setting while rarely considered the adaptation issues for object detection in real world such as data imbalance.

2.2 Related Work On Hard Example Mining and Cascade CNN

Gradually updating the set of background samples by selecting those from samples which are detected as false positives, bootstrapping Sung (1996) was the earliest solution to automatic employ hard samples for training. The strategy in bootstrapping led to an iterative process that alternates between updating the trained model and finding new false positives to add to the bootstrapped training set. Bootstrapping techniques were then successfully applied on detectors driven by CNN and SVMs for object detection Girshick et al. (2014); He et al. (2014), generally referred to as hard negative mining. After that, CNN detectors like Fast R-CNN Girshick (2015)

and its descendants were trained with stochastic gradient descent (SGD) on millions of samples, in which bootstrapping as an offline progress was no longer been adopted. To balance positive and negative training samples but without thinking of mining hard ones, Faster R-CNN

Ren et al. (2017) randomly used 256 samples in an image to compute the loss function of a mini-batch, where the positive and negative ones have a ratio of up to 1:1. A number of methods Simoserra et al. (2015); Loshchilov and Hutter (2015); Shrivastava et al. (2016) then focused on mining hard samples online for training convolutional networks. Rowley Simoserra et al. (2015) selected hard positive and negative samples from a larger set of random samples based on their loss independently. SermanetLoshchilov and Hutter (2015) and Shrivastava Shrivastava et al. (2016) focused on online hard sample selection strategies for mini-batch SGD methods and then OHEM Shrivastava et al. (2016) were introduced for region-based detectors which built mini-batches with the highest-loss samples. Recently, Focal Loss Lin et al. (2018) has been proposed to address the extreme foreground-background class imbalance problem in object detection with one-stage detectors, which applied a modulating term to the cross entropy loss in order to focus learning on hard negative examples. Analyzing of previous works shows that inchoate bootstrapping techniques are inappropriate for CNN-based detectors. Some online hard example mining strategies selected hard examples based on their loss, which are innovative but time-consuming. Focal Loss focused on dealing with data imbalance with one-stage detectors, while our works pay more attention to two-stage detectors with region proposals.

From another perspective, cascade structure is a widely used technique to discard easy samples at early stages for learning better classification models. Before the prosperity of CNNs, cascade structure were applied to SVM Felzenszwalb et al. (2010) and boosted classifiers Dollár et al. (2014); Xiao et al. (2003) with hand-crafted features. Multi-stage classifiers have been proved to be effective in generic object detection Felzenszwalb et al. (2010)

and face detection

Xiao et al. (2003); Bourdev and Brandt (2005), although these multiple classifiers were not trained jointly. It was showed that CNNs with cascade structure performed effectively on classification Ouyang et al. (2015); Li et al. (2015); Yang et al. (2016) as well, in which multiple but separate CNNs were trained. After that, Qin Qin et al. (2016) proposed a method to jointly train a cascade CNNs. The recent method Cascade R-CNN Cai and Vasconcelos (2018) trained Faster R-CNN with cascade increasing IoU thresholds, which was innovative, but without considering the data imbalance issue. Based on observations above, cascade structures are potential, but existing works either cannot be aggregated in the R-CNN based detection framework or have not considered building cascade structure on RPN to help extracting hard region proposals. Thus, confronting with the data imbalance problem in object detection in real world, the existing outcomes are very limited.

In this work, we propose C-RPNs to mine hard samples while extracting region proposals and learn more discriminative features for object detection in real world. Integrating with Faster R-CNN model, our proposed method, to the best of our knowledge, is the first cascade model of region proposal networks for object detection.

3 Proposed Method

3.1 Overview Of C-RPNs

Faster R-CNN consists of a shared backbone convolutional network, a region proposal network (RPN) and a final classifier based on region-of-interest (RoI), in which the RPN is employed to extract region proposals. Without considering mining hard samples in the process of RPN, Faster R-CNN shows its limited capacity in detection in realistic scenes. Our novel C-RPNs are firstly proposed to address this problem.

Figure 2: An overall of our proposed C-RPNs model. We adopt VGG16 as backbone network. ES refers to easy samples.

For performance comparison fairness, VGG16 is taken as the backbone network Simonyan and Zisserman (2014). Figure 2 shows an overview of our proposed C-RPNs model. At first, several shared bottom convolutional layers are used for extracting convolutional features from the image (Conv1-Conv4_1). Then, C-RPNs is adopted upon four different convolutional layers, which are Conv4_2, Con4_3, Conv5_2 and Conv5_3. Since feature maps from Conv5 have the same channels but half size compared with those from Conv4, we employ an average pooling with size of 2*2 upon Conv4_2 and Conv4_3 to obtain feature maps of same resolutions for these four stages. At stage 1

, the feature map extracted from Conv4_2 are used for generating region proposals and obtaining binary classification scores by a softmax function. The binary classification scores estimate a sample’s probabilities belonging to background and objects. With the classification scores and a reject threshold

r, part of easy samples will be rejected at this stage, which are detailed in Section 3.3. At stage 2, if a proposal has not been rejected at the former stage, then the feature map from Conv4_3 for this proposal is used for further binary classification. Similar processes are applied at stage 3 and stage 4. Since there is no constrain that the rejected samples must be background, few easy positive samples might also be rejected at early stages during training. It is worth to point out that the stage 4 achieves not only binary classification but also bounding box regression. After these four stages, the proposals have not been rejected are sent to RoI pooling layer for final detection. In this study, we set batch of each stage as 1024, 768, 512 and 256 respectively so that the stage 4 has the same batch size with RPN from Faster R-CNN. It is worth mentioning that the reason why we set only 4 stages not 5 or more is that employing the shallow and bigger feature maps from Conv3 contributes very limited performance gain but is time-consuming according to our experiments.

From Figure 2, it can be seen that C-RPNs takes different convolutional features stage-by-stage which enables it obtains different semantic information and receptive field. It is also noted that, in C-RPNs, the classifiers at shallow stages handle easier samples so that the classifiers at deeper stages focus on handling more difficult samples. The easy samples rejected by a classifier from shallow stage will not participate in the latter stages. With this design, abundant samples can be used but only hard samples been mined will go for final classification and bounding box regression, which benefits to alleviate the data imbalance problem.

To further enhance the classification capacity, a feature chain and a score chain are designed in C-RPNs, which are detailed in Section 3.2. In the end, the multi-stage classifications and bounding box regressions are learned in an end-to-end manner through backpropagation via a joint loss function, details are given in Section 3.3.

3.2 Feature Chain and Score Chain

Literature studies show that FPN Lin et al. (2017) and DSSD Fu et al. (2017) are effective for object detection using multiple convolutional layers. In this study, in order to capture the variation of features from different layers, a feature chain and a score chain at cascade stages are designed which are able to make use of features at previous stages as the prior knowledge for the classification at current stage. Not like the top-down pathway and lateral connections from FPN, our feature fusion operation follows the bottom-up pathway, which is the feed-forward computation of the VGG16. The description of feature chain and score chain is shown in Figure 3.

Figure 3: The proposed feature chain and score chain of C-RPNs.

We define the number of stages as T and t is the stage index. At stage t, we denote the features from convolutional layer as while features for classification as . The feature chain is formulated as following:


where denotes the summarized point to point. are hyper parameters controlling the weight of features from former stage and present convolutional layer to generate fusional features for classification. and add up to 1. Considering features from present convolutional layer are more helpful for classification, we set as 0.1 and as 0.9 according to our empirical tests (detailed in Section 4.5). The fused features are then used for classification.

At stage t, for each proposal have not been rejected at the t-1 stage, we denote the score from classifier t as while the output score of this stage as . The designed score chain has the following formulation.


In this implementation, features and scores at current stage make use of those from previous stages which enhance the capacity of the classifiers at current stage.

3.3 Cascade Loss Function with Samples Mining

Figure 4: Cascade Losses of our proposed C-RPNs. Faster R-CNN Ren et al. (2017) is displayed as baseline to show our characteristics.

In Faster R-CNN, training loss is composed of loss from RPN and ROIs. The former contains a binary classification loss and a regression loss. In our method, illustrated in Figure 4, C-RPNs contains four binary classification losses and a regression loss.

In C-RPNs, the cascade classifiers assign a sample’s probabilities to background and objects. k={0,1} is denoted to express these two class respectively. At stages , the set of class scores for a sample are denoted by . are scores at stage t for background and objects respectively. Another layer at stage 4 outputs bounding box regression offsets , for objects. Our proposed loss function of C-RPNs has the following formulation:


where is the loss for classification and is the loss for bounding box regression. For , we use the loss Girshick (2015). For , and are defined as follows, where is a parameter that controls the weight of loss from cascade classifiers and evaluates whether the sample is rejected at previous stages.


Here, we set , where T=4 in C-RPNs. Since scores from deeper classifiers are more crucial for final classification than those from shallow classifiers, from deeper classifiers has been distributed more weight with a tenfold increase based on our experience. For , we set the r as a threshold value at each stage. will output 1 if it is true or output 0 if it is false. If a sample has been rejected at previous stages, it will no longer be used for training the classifier at current stage. We set r as 0.99 according to our empirical tests (detailed in Section 4.5). If and , then is a normal cross entropy loss.

For the object detection with the proposed model, the final training loss is designed to compose the loss from C-RPNs and the loss from ROIs:


where and both are composed of classification loss and regression loss. The former contains four cascade binary classification losses while the latter contains a multi-class classification loss. With this loss function, multiple classifiers and bounding box regressions are learned jointly through backpropagation.

4 Experiments and Evaluations

4.1 Experimental setup

Datasets and Evaluation Metrics

We evaluated our approach on several public object detection datasets, including PASCAL VOC Everingham et al. (2010) , BSBDV 2017 Guan et al. (2018), Caltech Pedestrian Benchmark Dollar et al. (2009) and CityPersons Zhang et al. (2017). For evaluation, we used the standard average precision (AP) and mean average precision (mAP) scores with IoU thresholds at 0.5.

Pascal VOC. Pascal VOC involves 20 categories. VOC 2007 dataset consists of about 5k trainval images and 5k test images, while VOC 2012 dataset includes about 11k trainval images and 11k test images. Following the protocol in Girshick (2015), we perform training on the union of VOC 2007 trainval and VOC 2012 trainval. The test is conducted on VOC 2007 test set.

BSBDV2017. The Birds Dataset of Shenzhen Bay in Distant View Guan et al. (2018) is a great challenging dataset in wild scenes, consisting of 1,421 trainval images and 351 test images. BSBDV2017 contains three kinds of image resolutions, which are 2736*1824, 4288*2848 and 5472*3648 respectively. Size of birds varies greatly from 18*30 to 1274*632.

Caltech Pedestrian Benchmark. The Caltech Pedestrian Benchmark Dollar et al. (2009) includes a total of 350,000 bounding boxes of pedestrians. Approximately 2,300 unique pedestrians were annotated in roughly 250,000 frames. Following the protocol in Yang et al. (2018), one frame from every five frames of Caltech Benchmark and all frames of the ETH Ess et al. (2008) and TUD-Brussels Wojek et al. (2009) are extracted as training data, which includes 27,021 images in total. 4,024 images in the standard test set are used for evaluation.

CityPersons. The CityPersons Zhang et al. (2017) consists of images recorded across 27 cities, 3 seasons, various weather conditions and more common crowds. It creates high quality bounding box annotations for pedestrians in 5000 images, which is a subset of the Cityscapes dataset Cordts et al. (2016). 2975 images from train set and 500 images from val set are used for training and testing respectively.

Implementation Details

Faster R-CNN is taken as our baseline, where all parameters are set according to the original publication Ren et al. (2017)

if not specified. We initialize the backbone network using a VGG16 pre-trained model on ImageNet

Deng et al. (2009)

while all new layers are initialized by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. For training on Pascal VOC, we use a learning rate of 0.001 for 80k iterations and 0.0001 for 30k iterations. For training on the other datasets, we use a learning rate of 0.001 for 50k iterations and 0.0001 for 20k iterations. We trained our model in the end-to-end manner with Stochastic Gradient Descent (SGD), where the momentum is 0.9, and the weight decay is 0.0005. Our program is implemented by Tensorflow

Abadi et al. (2015) on a GPU of GeForce GTX TITAN X.

4.2 Overall Performance

Performance on Pascal VOC benchmark

We compare our approach with several state-of-the-arts in this subsection. Results in terms of mean average precision (mAP) are shown in Table 1. Our model achieves the second best performance among all methods, which is 1.2% lower than that of RON Kong et al. (2017) but 3.2% higher than that of baseline Faster R-CNN. Besides, it is happy to see that our method outperforms ION Bell et al. (2016) with the same backbone network which used features from Conv3_3, Conv4_3 and Conv5_3 to leverage context and multi-scale knowledge for object detection. From the table, we can see that although C-RPNs is designed aiming to improve detection in real world with imbalance data, it gets competitive performance on Pascal VOC benchmark.

Method Trainset mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Fast R-CNN Girshick (2015) 07+12 70.0 77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4
Faster R-CNN Ren et al. (2017) 07+12 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
SSD500 Liu et al. (2016) 07+12 75.1 79.8 79.5 74.5 63.4 51.9 84.9 85.6 87.2 56.6 80.1 70.0 85.4 84.9 80.9 78.2 49.0 78.4 72.4 84.6 75.5
ION Bell et al. (2016) 07+12 75.6 79.2 83.1 77.6 65.6 54.9 85.4 85.1 87.0 54.4 80.6 73.8 85.3 82.2 82.2 74.4 47.1 75.8 72.7 84.2 80.4
RON Kong et al. (2017) 07+12 77.6 86.0 82.5 76.9 69.1 59.2 86.2 85.5 87.2 59.9 81.4 73.3 85.9 86.8 82.2 79.6 52.4 78.2 76.0 86.2 78.0
SIN Liu et al. (2018b) 07+12 76.0 77.5 80.1 75.0 67.1 62.2 83.2 86.9 88.6 57.7 84.5 70.5 86.6 85.6 77.7 78.3 46.6 77.6 74.7 82.3 77.1
C-RPNs (ours) 07+12 76.4 78.6 79.5 76.3 66.5 63.2 84.6 87.8 87.8 60.2 83.3 71.7 85.5 86.1 81.4 79.2 49.2 75.2 73.9 83.1 75.7
Table 1: Results on PASCAL VOC 2007 test set. 07+12: union of Pascal VOC07 trainval and VOC12 trainval.

Performance on BSBDV 2017

Table 2 shows the comparisons of C-RPNs with state-of-the-arts on BSBDV 2017. As shown in Table 2, our method achieves the best performance and its average precision (AP) is 3.4% higher than the second best (FPN Lin et al. (2017)). More specifically, the AP of C-RPNs is 70.3%, which obtains 11% performance gain compared with that of Faster R-CNN. It is noted that our C-RPNs gets slightly lower mAP than that of RON Kong et al. (2017) on VOC 2007, but it outperforms RON by a margin of 12.3% on BSBDV 2017. Also, the AP of C-RPNs is 8.8% and 3.4% higher than that of R-FCN Dai et al. (2016) and FPN Lin et al. (2017) respectively. These results demonstrate that our C-RPNs is more competitive in object detection in real world.

Method Backbone Network AP(%)
SSD500 Liu et al. (2016) VGG16 reduce 42.0
Faster R-CNN Ren et al. (2017) VGG16 59.3
RON Kong et al. (2017) ResNet-101 58.0
R-FCN Dai et al. (2016) ResNet-50 61.5
FPN Lin et al. (2017) ResNet-50 66.9
SIN Liu et al. (2018b) VGG16 58.4
C-RPNs (ours) VGG16 70.3
Table 2: Performance Comparison on BSBDV 2017.

Comparison with baseline Faster R-CNN on pedestrian datasets

Pedestrian datasets like Caltech pedestrian benchmark Dollar et al. (2009) and CityPersons Zhang et al. (2017) are more challenging then Pascal VOC, which are collected via monitoring cameras on realistic street scenes. Performances on these two datasets are helpful to verify the efficiency of our approach since the scales and occlusion of pedestrians are changed frequently. Table 3 shows the comparisons of our C-RPNs with the baseline Faster R-CNN on these pedestrian datasets. Our C-RPNs achieves average precision of 48.1% and 51.4% on Caltech pedestrian benchmark and CityPersons, bringing 4.1% and 2.3% performance gain upon baseline Faster R-CNN, respectively, which indicates its robustness in intricate realistic scenes.

Method Caltech pedestrian benchmark CityPersons
Faster R-CNN Ren et al. (2017) 44.0 49.1
C-RPNs (ours) 48.1 51.4
Table 3: Performance Comparison on Caltech Pedestrian Benchmark and CityPersons.

4.3 Quantitive Examples

Qualitative Examples on wild bird detection

For visualization purpose, several examples of detection results on BSBDV 2017 are given in Figure 5. The rows from the top to the bottom are respectively expressed as the results of Faster R-CNN and C-RPNs. Detection boxes from detectors are marked red. For better observation, we marked boxes of miss detection cases in yellow. According to the ground truth, there are 46 and 22 birds in the top and bottom images, respectively. From the results, we can see that our method shows significantly improved recall for object detection in wild scenes, where 40 and 17 birds have been detected, respectively. Compared with the results detected with Faster R-CNN, our method brings 16 and 2 more birds detected in two images respectively. Meanwhile, dotted boxes in blue show samples are detected with more than one boxes, three in the top images and none in the bottom images, which indicates that our method is able to generate more precise bounding boxes.

Figure 5: Detection results of Faster R-CNN (row 1) and our proposed C-RPNs (row 2) on BSBDV 2017. Best viewed in color.

Qualitative Examples on practical pedestrian detection

In Figure 6, our proposed approach is trained on Caltech Pedestrian Benchmark and tested in realistic environments with random pedestrian flows. We show some detection images with different shooting angles such as looking down and looking up or with poor illumination, which are collected in subway, park and campus. Compared with the results from Faster R-CNN, our method brings more true positive and less false positive detections in these images respectively. It is found that some hard samples are falsely detected as background from Faster R-CNN, while those are detected aright as pedestrians from C-RPNs. According to the results in Figure 6, our proposed C-RPNs can adapt to harsh and complex environments to provide high quality object detection in real world.

Figure 6: Detection results of Faster R-CNN (row 1 and row 3) and our proposed C-RPNs (row 2 and row 4) on realistic pedestrian images.

4.4 Improvement analysis on false detections

To further examine the improvement of our C-RPNs upon baseline Faster R-CNN, the analysis tools Hoiem et al. (2012) upon Pascal VOC are employed to produce a detailed error analysis. In Pascal VOC, animal categories include ‘bird’, ‘cat’, ‘cow’, ‘dog’, ‘horse’, ‘sheep’ and ‘person’. ‘Plane’, ‘bicycle’, ‘boat’, ‘bus’, ‘car’, ‘motorbike’ and ‘train’ make up the vehicle categories. Figure 7 takes animals and vehicles as examples to show the frequency and impact on the performance of each type of false positive. As shown, C-RPNs reduces detection errors compared with Faster R-CNN when detecting both animals and vehicles. It is found that C-RPNs has less BG errors as well as Loc errors compared with the baseline, indicating that C-RPNs can classify and localize objects better because it mined hard samples during training and learned stronger classifiers. However, just like Faster R-CNN, detection results from C-RPNs have same confusions with similar object categories, partly because binary classifiers in cascade RPN module only indicate samples to be background or object, which has limited promotion on distinguishing the categories of an object.

Figure 7: Analysis of Top-Ranked False Positives. Pie charts: fraction of detections that are correct (Cor) or false positive due to poor localization (Loc), confusion with similar objects (Sim), confusion with other VOC objects (Oth), or confusion with background (BG). Bar graphs: absolute AP improvement by removing all false positives of one type. ‘L’: first bar segment is improvement if duplicate or poor localizations are removed; second bar is improvement if localization errors are corrected so that the false positives become true positives. ‘B’: no confusion with background and non-similar objects. ‘S’: no confusion with similar objects. The first and second columns: results of the baseline Faster R-CNN and C-RPNs on detecting animals. The third and fourth columns: results of the baseline Faster R-CNN and C-RPNs on detecting vehicles.
Figure 8: Characteristics analysis of different bird characteristics on VOC2007 test set: Each plot shows Normalized AP (APN Hoiem et al. (2012)

) with standard error bars (red). Black dashed lines indicate overall APN. Key: Occlusion: N=none; L=low; M=medium; H=high. Truncation: N=not truncated; T=truncated Bounding Box Area: XS=extra-small; S=small; M=medium; L=large; XL =extra-large. Aspect Ratio: XT=extra-tall/narrow; T=tall; M=medium; W=wide; XW =extra-wide. Viewpoint / Part Visibility: ’1’=part/side is visible; ’0’=part/side is not visible. Left: results of the baseline Faster R-CNN. Right: results of C-RPNs.

Figure 9: Summary of Sensitivity and Impact of Object Characteristics. The APN are shown over 7 categories of the highest performing and lowest performing subsets within each characteristic (occlusion, truncation, bounding box area, aspect ratio, viewpoint, part visibility). Overall APN is indicated by the dashed line. The difference between max and min indicates sensitivity. The difference between max and overall indicates the impact. Left: results of the baseline Faster R-CNN. Right: results of C-RPNs.

Figure 8 visualizes the analysis of different bird characteristics on VOC2007 test set. Performance improvements on these characteristics are explicit since C-RPNs achieves higher average precision on all characteristics of Occlusion, Truncation, Bounding Box Area, Aspect Ratio, Viewpoint and Part Visibility. It is worth mentioning that when occlusion is High, C-RPNs can still recognize some birds while baseline Faster R-CNN detects nothing. Furthermore, extra annotations of seven categories(‘airplane’, ‘bicycle’, ‘bird’, ‘boat’, ‘cat’, ‘chair’, ‘table’) are created in Hoiem et al. (2012) for evaluating robustness of detection approaches. Figure 9 provides a compact summary of the sensitivity to each characteristic and the potential impact of improving robustness on seven categories. Overall, our C-RPNs achieves higher normalize average precision than Faster R-CNN against all characteristics, indicating its robustness in various scenes. Moreover, sensitivity against all these characteristics are decreased, which verifies that C-RPNs realizes an all-sided improvements upon Faster R-CNN. On the other side, we can see that C-RPNs is sensitive to the bounding box size just like Faster R-CNN and there is still some room to improve.

4.5 Ablation Studies

In previous sections, we have shown the efficiency of C-RPNs on several datasets. To further evaluate the individual effect of components of our C-RPNs, we analyze the object detection performance affected by the cascade stages as well as feature chain and score chain. We use BSBDV 2017 in this section.

Effects of cascade stages

To learn the efficiency of our C-RPNs with different number of cascade stages, results are summarized in Table 4. We remove different stages of C-RPNs to demonstrate their individual effect. It can be seen that, with stage 3 and stage 4, C-RPNs achieves AP of 69.5% which already outperforms the baseline Faster R-CNN. Adding stage 2 and stage 1 yields AP of 69.9% and 70.3% respectively, and it brings 0.4% and 0.4% performance gain respectively. Finally, the 4-stage cascade RPNs achieves the best performance. These results validate that employing more cascade stages and classifiers in the C-RPNs benefits the detection performance.

AP of C-RPNs(%) 69.5 69.9 70.3
C-RPNs with Stage 4
C-RPNs with Stage 3
C-RPNs with Stage 2
C-RPNs with Stage 1
Table 4: The impact of cascade stages (BSBDV 2017).

Effects of feature chain and score chain

To learn the impact of feature chain and score chain more specifically, Table 5 shows the results of our C-RPNs with or without feature chain and score chain. We set the same parameters for C-RPNs with previous sections but control the usage of feature chain and score chain separately. As shown in Table 5, feature chain is found to be effective in C-RPNs, which brings 0.6% performance gain. When we adapt score chain but without feature chain, the AP is 0.4% higher, which illustrates the efficiency of using score chain as well. The adjustment boosts the performance by 0.9% while both feature chain and score chain are used. These results verify that using features and scores at previous stages as the prior knowledge for the latter stages promotes the final detection.

AP of C-RPNs(%) 69.4 70.0 69.8 70.3
Feature Chain
Score Chain
Table 5: The impact of feature/score chain (BSBDV 2017).

Selection of reject threshold and fusion rate

To find the best hyper parameters, empirical tests were conducted using different reject threshold r and fusion rate on BSBDV 2017 through one-dimensional grid search. Figure 10 shows the impacts of these two factors. As shown, reject threshold r=0.99 achieved the best AP of 70.31% when the fusion rate was fixed at 0.1. We then fixed the reject threshold as 0.99 and applied a grid search by changing the fusion rate . From Figure 10, the best is observed as 0.1 with the AP of 70.31%.

Figure 10: Grid search for the best reject threshold and fusion rate. Left: accuracy vs reject threshold r; Right: accuracy vs fusion rate .

5 Conclusion

In this paper, we have constructed C-RPNs, an effective approach for object detection in real world. The essence of our C-RPNs lies in adopting a cascade structure of region proposal networks to discard easy samples during training and learn stronger classifiers. Moreover, a feature chain and a score chain at multiple stages have been proposed to help generating more discriminative representations for proposals. Finally, a loss function of cascade stages is designed to jointly learn cascade classifiers. Extensive experiments have been conducted to evaluate our C-RPNs on a common benchmark (Pascal VOC) and several challenging datasets collected in wild scenes or realistic traffic scenes. Our C-RPNs achieves competitive results compared with the current state-of-the-arts and outperforms the baseline Faster R-CNN by an obvious margin, demonstrating its efficacy for object detection in real world.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. (2015)

    TensorFlow: large-scale machine learning on heterogeneous distributed systems

    arXiv: Distributed, Parallel, and Cluster Computing. Cited by: §4.1.
  • S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick (2016)

    Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2874–2883. Cited by: §2.1, §4.2, Table 1.
  • L. Bourdev and J. Brandt (2005) Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. IEEE Computer Society Conference on, Vol. 2, pp. 236–243. Cited by: §2.2.
  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162. Cited by: §2.2.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §4.1.
  • J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.1, §4.2, Table 2.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 764–773. Cited by: §2.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. IEEE Conference on, pp. 248–255. Cited by: §4.1.
  • P. Dollar, C. Wojek, B. Schiele, and P. Perona (2009) Pedestrian detection: a benchmark. In Computer Vision and Pattern Recognition, 2009. IEEE Conference on, pp. 304–311. Cited by: §1, §4.1, §4.1, §4.2.
  • P. Dollár, R. Appel, S. Belongie, and P. Perona (2014) Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (8), pp. 1532–1545. Cited by: §2.2.
  • A. Ess, B. Leibe, K. Schindler, and L. Van Gool (2008) A mobile vision system for robust multi-person tracking. In Computer Vision and Pattern Recognition, 2008. IEEE Conference on, pp. 1–8. Cited by: §4.1.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §1, §1, §1, §2.1, §4.1.
  • P. F. Felzenszwalb, R. B. Girshick, and D. McAllester (2010) Cascade object detection with deformable part models. In Computer vision and Pattern Recognition, 2010 IEEE Conference on, pp. 2241–2248. Cited by: §2.2.
  • C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §3.2.
  • S. Gidaris and N. Komodakis (2015) Object detection via a multi-region and semantic segmentation-aware cnn model. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1134–1142. Cited by: §2.1.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587. Cited by: §2.1, §2.2.
  • R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international Conference on Computer Vision, pp. 1440–1448. Cited by: §2.1, §2.2, §3.3, §4.1, Table 1.
  • W. Guan, Y. Zou, and X. Zhou (2018) Multi-scale object detection with feature fusion and region objectness network. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2596–2600. Cited by: §1, §1, §4.1, §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pp. 346–361. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2.1.
  • D. Hoiem, Y. Chodpathumwan, and Q. Dai (2012) Diagnosing error in object detectors. In European Conference on Computer Vision, pp. 340–353. Cited by: Figure 8, §4.4, §4.4.
  • T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen (2017) Ron: reverse connection with objectness prior networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 2. Cited by: §4.2, §4.2, Table 1, Table 2.
  • H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua (2015) A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5325–5334. Cited by: §2.2.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 4. Cited by: §2.1, §3.2, §4.2, Table 2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §1, §2.1.
  • S. Liu, D. Huang, et al. (2018a) Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 385–400. Cited by: §2.1.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In European Conference on Computer Vision, pp. 21–37. Cited by: §2.1, Table 1, Table 2.
  • Y. Liu, R. Wang, S. Shan, and X. Chen (2018b) Structure inference net: object detection using scene-level context and instance-level relationships. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6985–6994. Cited by: §2.1, Table 1, Table 2.
  • I. Loshchilov and F. Hutter (2015) Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343. Cited by: §2.2.
  • W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, et al. (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Cited by: §2.2.
  • C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun (2018) MegDet: a large mini-batch object detector. Computer Vision and Pattern Recognition. Cited by: §2.1.
  • H. Qin, J. Yan, X. Li, and X. Hu (2016) Joint training of cascaded cnn for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3456–3465. Cited by: §2.2.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §2.1.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §1, §2.1, §2.2, Figure 4, §4.1, Table 1, Table 2, Table 3.
  • A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §2.2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. In Computer Science, Cited by: §2.1, §3.1.
  • E. Simoserra, E. Trulls, L. Ferraz, I. Kokkinos, and F. Morenonoguer (2015) Fracking deep convolutional image descriptors. In Computer Science, Cited by: §2.2.
  • K. Sung (1996) Learning and example selection for object and pattern detection. Cited by: §2.2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §2.1.
  • C. Wojek, S. Walk, and B. Schiele (2009) Multi-cue onboard pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 794–801. Cited by: §4.1.
  • R. Xiao, L. Zhu, and H. Zhang (2003) Boosting chain learning for object detection. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 709–715. Cited by: §2.2.
  • B. Yang, J. Yan, Z. Lei, and S. Z. Li (2016) Craft objects from images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6043–6051. Cited by: §2.2.
  • D. Yang, J. Zhang, S. Xu, S. Ge, G. H. Kumar, and X. Zhang (2018) Real-time pedestrian detection via hierarchical convolutional feature. Multimedia Tools and Applications 77 (19), pp. 25841–25860. Cited by: §4.1.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In British Machine Vision Conference, Cited by: §2.1.
  • L. Zhang, L. Lin, X. Liang, and K. He (2016) Is faster r-cnn doing well for pedestrian detection?. In European Conference on Computer Vision, pp. 443–457. Cited by: §2.1.
  • S. Zhang, R. Benenson, and B. Schiele (2017) Citypersons: a diverse dataset for pedestrian detection. In The IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1, pp. 3. Cited by: §1, §4.1, §4.1, §4.2.