Cascade Region Proposal and Global Context for Deep Object Detection

10/30/2017 ∙ by Qiaoyong Zhong, et al. ∙ Hikvision 0

Deep region-based object detector consists of a region proposal step and a deep object recognition step. In this paper, we make significant improvements on both of the two steps. For region proposal we propose a novel lightweight cascade structure which can effectively improve RPN proposal quality. For object recognition we re-implement global context modeling with a few modications and obtain a performance boost (4.2 validation set). Besides, we apply the idea of pre-training extensively and show its importance in both steps. Together with common training and testing tricks, we improve Faster R-CNN baseline by a large margin. In particular, we obtain 87.9 set and 36.8



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is a fundamental problem in computer vision. It has been widely studied for many years 


, among which the state-of-the-art approaches are based on convolutional neural networks (CNN) 

sermanet2013overfeat ; simonyan2014very ; erhan2014scalable ; szegedy2014scalable ; redmon2015you ; liu2015ssd . Girshick et al. girshick2014rich proposed region-based CNN (R-CNN), which successfully transfers the image-level recognition power of CNN to object detection. Afterwards, R-CNN was subsequently developed and accelerated in SPP-Net he2015spatial , NoC ren2015object , Fast R-CNN girshick2015fast and Faster R-CNN ren2015faster .

In Faster R-CNN, a carefully designed region proposal network (RPN) was introduced to extract high-quality region proposals. The extracted proposals are then fed into Fast R-CNN (FRCN) for object recognition. In this paper, we make extensive improvements on both region proposal quality (RPN side) and object recognition accuracy (FRCN side). Our contribution is three-fold. 1) We propose a lightweight cascade RPN architecture, which can extract accurately localized region proposals with marginal extra computational cost. 2) We revisit global context modeling. With a few modifications in network architecture and the idea of pretraining, we obtain significant performance gain. 3) We systematically evaluate common training and testing tricks that can be found in the literature and report their contributions to detection performance.

Based on these improvements over Faster R-CNN baseline, we achieve the state-of-the-art performance on PASCAL VOC 2012 everingham2010pascal , ILSVRC 2016 russakovsky2015imagenet and COCO lin2014microsoft .

2 Related Work

Cascade Region Proposal

Conventional region proposal methods are normally based on low-level features, either unsupervised (e.g. Selective Search uijlings2013selective and EdgeBoxes zitnick2014edge ) or supervised (e.g. BING cheng2014bing ). With the success of CNN in computer vision krizhevsky2012imagenet , high-level semantic CNN features are adopted for region proposal. Taking RPN as an example, a fully convolutional architecture is designed to extract high-level features and predict proposals in an end-to-end way. On the other hand, refining region proposals with a multi-stage cascading pipeline has also been explored. DeepBox Kuo2015DeepBox uses CNN to rerank proposals from a conventional method, which may be considered as a special form of cascading. CRAFT yang2016craft uses a two-class Fast R-CNN to refine proposals from a standard RPN, which was further developed in zeng2016crafting . DeepProposals ghodrati2015deepproposal proposes an inverse cascade to exploit feature maps of different levels. Although our design shares similar pipeline with CRAFT yang2016craft , we use a modified RPN instead of Fast R-CNN in the second cascading stage. The details are described in 3.2.2.

Context Modeling

Objects in the real world do not exist on their own. They are surrounded by a background (e.g. sky, grassland) and likely to coexist with other objects, either of the same category or not. This context may provide valuable information for discriminating objects of interest. In R-CNN, classification of an object is solely based on the regional information. In SPP-Net and Fast R-CNN, with the enlarged receptive fields of convolution layers, contextual information is implicitly yet only partially exploited as regional features are cropped from feature maps of deep layers. Gidaris and Komodakis gidaris2015object proposed a Multi-Region CNN architecture, where a rectangular ring around the object is cropped and serves as context. chen20153d ; cai2016unified utilized an enlarged area (e.g. 1.5) around the object as local context. Bell et al. bell16ion

used spatial recurrent neural networks (RNN) to model context outside the object. Ouyang et al. 


used ImageNet 1000-class whole image classification scores as scene information to refine per-object classification scores. He et al. 

he2016deep proposed global context modeling, where the image-level features are extracted through RoI pooling over the whole image. In this paper, we reimplement global context modeling in he2016deep with a few modifications.

3 Methods

First of all, we will introduce the backbone network we use. Then our improvements on both region proposal and classification are described, followed by common training and testing tricks and our configurations.

3.1 Backbone Network

Following he2016deep , we choose ResNet-101 as the backbone network. Since the birth of ResNet, it has been further developed. Gross and Wilber gross2016training reimplemented ResNet with a few modifications, e.g. down-sampling in the convolution instead of the first convolution. Afterwards, He et al. he2016identity

revisited residual connection and proposed the identity mapping variant. In this paper, we adopt both

convolution down-sampling and identity mapping in the ResNet-101 backbone network. The backbone network is pretrained on the ImageNet classification dataset and then fine-tuned for detection tasks.

3.2 Improving Region Proposal

It is obvious and also has been reported hosang2016makes that proposal quality is crucial for region based detectors. CNN-based RPN can extract more compact and high-quality proposals than conventional methods. Based on RPN, we propose several strategies to further improve proposal quality, including pretraining RPN, cascade RPN and constrained ratio of negative over positive anchors.

3.2.1 Pretraining RPN

Pretraining on a large-scale dataset and transferring the learned features to another task with smaller dataset have been the de facto strategy in deep learning based applications. It has been widely reported that fine-tuning from a pretrained model can combat overfitting effectively 

simonyan2014very , leading to superior performance over training from scratch. In this paper, we apply this idea extensively. The basic principle is to pretrain as many layers as possible.

The RPN sub-network is connected to an intermediate layer, e.g. the last layer of conv4_x block for ResNets. And normally an extra convolution layer is added, which acts as a buffer to prevent direct back-propagation of gradients from the RPN branch cai2016unified

. In Faster R-CNN baseline, this layer is randomly initialized and trained from scratch during fine-tuning. We pretrain this layer using an auxiliary classifier with a weight of 0.3 as in 

szegedy2015going . Figure 1

shows the architecture. The preceding BN-ReLU layers are inserted to comply with the identity mapping design. When fine-tuning RPN, the linear classifier as well as global average pooling (GAP) are replaced with RPN’s classification and bounding box regression layers.

Figure 1: Pretraining RPN with an auxiliary classifier. The trunk classifier on the left is for pretraining R-CNN, while the auxiliary classifier on the right is for pretraining RPN. GAP denotes global average pooling.

3.2.2 Cascade RPN

We propose a lightweight cascade architecture to refine score and location of RPN proposals. Figure 2 shows the pipeline. The upper part is a standard RPN (RPN 1), which utilizes sliding window proposals as anchors and produces bounding box regressed proposals. Besides RPN 1 we add another RPN sub-network (RPN 2), which takes output proposals of RPN 1 as input. Note that the input proposals of RPN 2 refer to those right after bounding box regression without any post-processing (e.g. sorting, non-maximum suppression and truncating in number). Thus there is an one-to-one correspondence between sliding window anchors and input proposals of RPN 2. When training RPN 2, the sliding window anchors are used to locate the pixel position in the feature map, while their corresponding proposals are used to compute classification and bounding box regression targets.

During inference, we find that RPN 2 improves recall of medium and large objects while hurts recall of small objects. So we combine small bounding boxes of RPN 1 and medium-large bounding boxes of RPN 2 as the final set of proposals. In our experiments, we set the size threshold to in pixels.

Compared with other cascade region proposal methods yang2016craft , our approach has the following advantages.

Easy to implement

The logic of RPN 2 is very similar to standard RPN. We only need to replace the sliding window anchors with proposals from RPN 1. The network architecture is rather brief and can be easily configured with common deep learning frameworks like Caffe 

jia2014caffe .

Computationally efficient

yang2016craft uses an extra two-class Fast R-CNN to refine RPN proposals, which is computationally inefficient. While our method works by stacking two RPN networks sequentially. The fully convolutional nature makes it very efficient in terms of both computational and memory cost.

Figure 2: Cascade RPN pipeline by stacking two RPN sub-networks sequentially. RPN 2 is adapted so as to take output proposals of RPN 1 as reference.

3.2.3 Constrained N/P Anchor Ratio

Another improvement to RPN we make is to control (normally limit) the ratio of negative and positive (N/P) anchors during training phase. In naïve RPN, a common choice for anchor batch size is 256 and expected N/P ratio is 1. However in practice we find that this ratio may be very large, usually greater than 10. The imbalance may lead to a bias towards background class, thus hurting proposal recall. To address this issue, we add two more hyper-parameters for training RPN, i.e. max_np_ratio and min_batch_size. max_np_ratio works by shrinking batch size when there are too few positive anchors. And min_batch_size makes sure that the effective batch size will not become too small. In our experiments, we arbitrarily set max_np_ratio to 1.5 and min_batch_size to 32.

3.3 Augmenting Classification with Pretrained Global Context

Our implementation of global context is based on the work by He et al. he2016deep with a few modifications. Figure 3 demonstrates our design. Besides the RoI pooling over each RoI region, an RoI pooling over the entire image is performed to extract global features. Since bounding box regression is based on relative position and scale shift of target objects over proposals, the surrounding context information may not help at all, or even confuses the regressor. Thus unlike he2016deep , we use global context features for classification only, not for bounding box regression.

Furthermore, we apply the pretraining principle (section 3.2.1) again on global context. For extremely deep networks, the number of newly-added layers for the global context branch may get quite large, e.g. 9 convolution layers for ResNet-101. When fine-tuning a detection network, we normally choose a relative low base learning rate (e.g. 0.001). It would be difficult to train 9 layers from scratch with a low learning rate. Pretraining is extremely critical in this scenario. In our experiments, we simply copy the pretrained parameters of conv5_x to the global context branch for initialization.

Figure 3: Global context modeling through RoI pooling over the entire image. The combination of per-RoI features and global features are used to classify the objects of interest.

3.4 Training Tricks

3.4.1 Balanced Sampling

Data imbalance is a commonly occurring problem in vision recognition tasks. For example, the ImageNet detection (DET) training set is highly unbalanced (Figure 4). The top-3 frequent classes (i.e. dog, person and bird) comprise 36% of all object instances. The biggest class can be over 100 times larger than the smallest class. The conventional sampling strategy is to construct an universal sample list for all classes and read samples consecutively from the list, which will cause a bias towards big classes.

Shen et al. shen2016relay

proposed a class-aware sampling strategy to cope with the imbalance issue in the scene classification task. We adapt this strategy for object detection task. During training, we first sample a class, then sample an image containing objects of this class.

Figure 4: Highly unbalanced distribution of the 200-class objects in the ImageNet DET training set. The top-3 frequent objects are dog, person and bird.

3.4.2 Negative Samples Training

In the original implementation of Faster R-CNN, background images without any object instance are discarded. In our opinion, these negative samples are valuable in the sense that they enrich the background class and help reduce false alarms. In our implementation, negative samples can be utilized for training the classifier. While bounding box regression is not involved as there is no object instance available.

3.4.3 Multi-Scale Training

For data augmentation, besides random horizontal flipping, we also adopt multi-scale training. During training, the short side of the image is resized to one of four scales {400, 500, 600, 700} randomly.

3.4.4 Online Hard Example Mining

We follow the online hard example mining (OHEM) paper shrivastava2016training for training the FRCN branch. 300 proposals are fed to forward propagation, while the most difficult 64 proposals measured by loss value are selected for backward propagation. We adopt the implementation by Dai et al. dai2016r .

3.5 Testing Tricks

During inference, common tricks like multi-scale testing, horizontal flipping and weighted box voting are applied. Based on validation, we choose the following hyper-parameter settings.

  • Multi-scale testing is used for both RPN and FRCN, i.e. three scales ({400, 600, 800}) for RPN, and five scales ({200, 400, 600, 800, 1000}) for FRCN.

  • Horizontal flipping is also used for both RPN and FRCN.

  • Weighted box voting is used only for FRCN, as we find that it hurts RPN proposal recall at high IOU thresholds (e.g. 0.7).

When doing multi-scale testing and horizontal flipping, we use different merging strategies for RPN and FRCN. For RPN, all sets of proposals from various scale and flip settings are combined and their union set is post-processed through non-maximum suppression (NMS). For FRCN, scores and refined bounding boxes of various scale and flip settings are averaged.

Furthermore, IOU thresholds of NMS are tuned for different datasets based on performance on validation set. Specifically, we use 0.4 on ImageNet and 0.45 on PASCAL VOC and COCO instead of the commonly used 0.3.

3.6 Implementation Details

Based on py-faster-rcnn111, we reimplement Faster R-CNN in C++ under the Caffe framework jia2014caffe , which allows us to perform efficient multi-GPU training. We adopt joint training of RPN and FRCN instead of the 4-step training scheme for convenience. In all experiments, we use the following settings.

  • The RPN anchor scales are extended to 6 scales ({32, 64, 96, 128, 256, 512}).

  • Per-GPU RPN batch size is unchanged (256), while FRCN batch size is reduced from 128 to 64.

  • Models are trained using 4 GPUs. Thus the effective batch size is 4.

  • The conventional SGD with 0.9 momentum, 0.0001 weight decay and the step learning rate policy with 0.001 base learning rate are used for optimization.

4 Results and Discussion

To evaluate performance of our approach, as well as common training and testing tricks, we perform extensive ablation studies on ImageNet detection (DET) and localization (LOC) datasets. Besides, results on PASCAL VOC 2012 and COCO are also reported.

4.1 ImageNet DET

Following girshick2014rich , we split the validation set into two parts (val1 and val2). train+val1 set is used for training, and val2 set is used for parameter tuning and model selection. The total number of training samples is for positive samples only and including the extra negative training data.

In each experiment, we train the model for 200k iterations with the learning rate of 0.001, and then for extra 100k iterations with 0.0001.

4.1.1 Faster R-CNN Baseline

We pretrain the ResNet-101 backbone network on the ImageNet classification dataset. The top-1 accuracy on the validation set is 76.3%, which is on par with he2016deep . With fine-tuning on the ImageNet detection dataset, we obtain a baseline Faster R-CNN model, whose mAP is 52.6% on val2.

4.1.2 Improved Region Proposal

Based on the Faster R-CNN baseline, ablation studies are performed on region proposal improvements. Table 1 shows the results. Compared with baseline, pretrained RPN improves recall@0.5 by 1.2 points and mAP@0.5 by 0.3 point. While no improvement on recall at high IOU thresholds is obtained. Cascade RPN improves recall@0.7 by 8.8 points and average recall (AR) by 4.8 points, leading to 0.8% mAP@0.7 gain. Constrained N/P further increases recall by up to 3 points. Combining all the three strategies, we obtain 2% recall@0.5 gain, 11.2% recall@0.7 gain and 6.9% AR gain. In terms of detection performance, we obtain 0.5% mAP gain at normal IOU threshold.

From the results we can conclude that our approach significantly improve localization accuracy of region proposals. However, mAP gain is relatively limited. After the competition, we have more time to look into this issue. The reason might be that the extra RPN branch of our cascade RPN architecture introduces too much impact on back-propagation of region proposal errors. In other words, as RPN and FRCN are jointly trained within one network, FRCN performance declines as RPN gets augmented. To validate our hypothesis, we design two more experiments. In the first experiment, we reduce loss weight of RPN by half, i.e. from 1 to 0.5 (LW=0.5). In the second experiment, we replace joint training with a 2-step training scheme, in which RPN and FRCN are optimized independently. In this case, the possible competition between RPN and FRCN is fully eliminated.

The results are displayed in Table 1. After reducing loss weight of RPN, we obtain an extra 0.9% and 1.2% mAP gain at 0.5 and 0.7 IOU thresholds respectively, even though proposal recall declines slightly. When using the 2-step training scheme, proposal recall gets improved, while detection mAP remains on par the reduced loss weight setting. From these experiments we may conclude that while joint training of RPN and FRCN can achieve comparable performance to separate training, balancing the loss weights of RPN and FRCN is very important. With the issue addressed, in total we obtain 1.4% mAP gain at 0.5 IOU and 1.9% at 0.7 IOU from region proposal improvements.

On an M40 graphics card, the average running time over 1000 samples to extract 300 proposals is 87ms for the RPN baseline and 98ms for our cascade RPN. The extra computational cost (11ms) is marginal, considering the significant improvement on proposal quality and detection performance.

Pretrained RPN
Cascade RPN
Constrained N/P
2-step training
Recall@0.5 88.5 89.7 88.5 90.5 90.0 90.7
Recall@0.7 67.1 66.5 75.3 78.3 77.4 78.7
AR@[0.5:0.95] 49.9 50.0 54.8 56.8 56.0 58.0
mAP@0.5 52.6 52.9 52.9 53.1 54.0 54.0
mAP@0.7 39.1 39.3 40.1 39.8 41.0 40.9
Table 1: Region proposal recalls and detection results on the ImageNet DET val2 set. All results are based on 300 proposals. AR denotes average recall over IOU thresholds from 0.5 to 0.95 with a step of 0.05.

4.1.3 Pretrained Global Context

Based on current best model (not including the post-competition improvement) with improved region proposal, we design two experiments on global context. In the first experiment, the global context branch in Figure 3 is randomly initialized with the xavier policy glorot2010understanding and trained from scratch. While in the second experiment, it is fine-tuned from pretrained parameters. The results are shown in Table 2. In the case of pretraining, global context works surprisingly well, with a 4.2 points mAP gain. While nearly no improvement is obtained in the case of random initialization. he2016deep reported a 1 point mAP gain with global context on COCO dataset lin2014microsoft . Although it is not comparable across different datasets, our implementation is proven to be very effective.

Baseline Random Pretrained
mAP 53.1 53.2 57.3
Table 2: Detection results of global context modeling on the ImageNet DET val2 set. Baseline: no global context. Random: randomly initialized global context. Pretrained: pretrained global context.

4.1.4 Training and Testing Tricks

Taking the network with pretrained global context as the new baseline, we report further performance improvements from common training and testing tricks in Table 3. Balanced sampling leads to 2.5 points mAP gain, which is reasonable considering the severe data imbalance. With negative samples added for training mAP increases by 0.8 point. While no improvement is observed using multi-scale training. By OHEM, an extra 1.2 points mAP gain is obtained, leading to 61.7% mAP without testing tricks that may increase running time dramatically. With multi-scale testing, horizontal flipping and weighted box voting, mAP reaches 64.1%. With pretraining on the ImageNet localization (LOC) dataset, an extra 0.8 point mAP gain is obtained. Finally, by combing the best region proposals and object classifier, mAP reaches 65.1%, which is our best single-model result.

Train Balanced sampling
+Negative samples
Multi-scale training
Test Multi-scale testing
Horizontal flipping
Box voting
Pretrain on LOC
Comb. best RPN & FRCN
mAP 57.3 59.8 60.6 60.5 61.7 63.3 63.9 64.1 64.9 65.1
Table 3: Detection results of common training and testing tricks on the ImageNet DET val2 set.

4.1.5 Model Ensemble

With model ensemble, we obtain 67.0% mAP on the ImageNet DET val2 set and 65.3% mAP on the test set (Table 4), which ranks 2nd in ILSVRC 2016 challenge. In terms of single-model result, we obtain 63.4% mAP on the test set, surpassing CUImage zeng2016crafting with slight advantage. Compared with the MSRA baseline he2016deep , we improve mAP by 4.6 points for single-model result and 3.2 points for ensemble.

val2 test
Single model MSRA he2016deep (ILSVRC’15) 60.5 58.8
Ours (ILSVRC’16) 65.1 63.4
CUImage zeng2016crafting (ILSVRC’16) 65.0 63.36
Ensemble MSRA he2016deep (ILSVRC’15) 63.6 62.1
CUImage zeng2016crafting (ILSVRC’16) 68.0 66.3
Ours (ILSVRC’16) 67.0 65.3
Table 4: Final detection results on the ImageNet DET val2 and test set.

4.2 ImageNet LOC — Localization by Detection

The object localization task (LOC) of ILSVRC is more challenging as the number of classes (1000) is much larger than DET (200). He et al. he2016deep proposed a per-class RPN + R-CNN pipeline for object localization. On the contrary, we simply apply our object detection approach to object localization. Since localization performance is limited by classification performance, for a fair comparison, we use a classification model with equivalent error rate as in he2016deep . As shown in Table 5, given the same classification error rate (4.6%), our localization error rate on the validation set is 10.2%, surpassing the MSRA baseline he2016deep by 0.4 point. By model ensemble, we further reduce localization error rate to 8.8%. On the test set, even with a relatively inferior classification performance (3.7% versus 3.6% top-5 error rate), our localization error rate reaches 8.7%, surpassing the MSRA baseline he2016deep by 0.3 point. Our submission won the 2nd place in the object localization task of ILSVRC 2016.

val test
LOC err. CLS err. LOC err. CLS err.
Single model MSRA he2016deep 10.6 4.6 - -
Ours 10.2 4.6 - -
Ensemble MSRA he2016deep 8.9 - 9.0 3.6
Ours 8.8 3.5 8.7 3.7
Table 5: Object localization results (top-5 error) on the ImageNet LOC validation and test set.

4.3 Pascal Voc

Besides ImageNet, we also evaluate our approach on PASCAL VOC 2012. For VOC data, we use trainvaltest set of VOC 2007 as well as train set of VOC 2012 (07++12) for training and val set of VOC 2012 for validation. COCO dataset is used as extra training data (COCO+07++12). Since COCO has 80 categories, we only use those 20 categories that are present in PASCAL VOC. After training on COCO+07++12 converges, we further fine-tune the model on VOC 07++12.

Table 6 shows the detection results on both val and test sets of PASCAL VOC 2012. Using standard PASCAL VOC training data, our mAP is 82.9% on val set and 81.8% on test set. With extra COCO data, mAP is boosted to 86.7% and 86.0% respectively, surpassing the MSRA baseline he2016deep by 2.2 points. This performance is still competitive to more recent methods like Deformable R-FCN dai17dcn and ResNeXt-101 xie2017 , considering the fact that they use the val set of VOC 2012 as extra training data while we do not. With further pretraining on ImageNet DET, our mAP on test set reaches 87.9%, which is the new state-of-the-art single-model performance at the time of writing this paper.

Method Training data mAP
val test
MSRA he2016deep COCO+07++12 - 83.8
Deformable R-FCN dai17dcn COCO+07++12 - 87.1
ResNeXt-101 COCO+07++12 - 86.1
R-FCN dai2016r COCO+07++12 - 85.0
Ours 07++12 82.9 81.8
Ours COCO+07++12 86.7 86.0
Ours DET+COCO+07++12 88.7 87.9
Table 6: Single-model detection results on PASCAL VOC 2012. Note that unlike other submissions, val set of VOC 2012 is not used for training in all of our experiments. : : :

4.4 Coco

We further evaluate our approach on the more challenging COCO dataset. Table 7 shows a comparison of the current state-of-the-arts in terms of single-model performance on the test-dev and test-std sets. On test-dev, we achieve 36.7% AP, surpassing the MSRA he2016deep baseline by 1.8 points. In particular, our approach is superior over the very recent FPN lin2016fpn , where feature maps of multiple scales are exploited. Deformable R-FCN dai17dcn is the current leading method. However, they use Inception-ResNet, which is a much stronger backbone network than ResNet-101 we use. At 0.5 IOU threshold, we get 60.3% and 60.5% AP on test-dev and test-std respectively, which is significantly superior over all other approaches. This is due to our improvements (e.g. pretrained global context) on FRCN object classification accuracy. Figure 5 shows some examples of our detection results. Our approach is robust against variations of object size and aspect ratio.

Method Backbone test-dev test-std
AP AP@.5 AP AP@.5
Deformable R-FCN dai17dcn Inception-ResNet 37.5 58.0 37.3 58.0
Ours ResNet-101 36.7 60.3 36.8 60.5
FPN lin2016fpn ResNet-101 36.2 59.1 35.8 58.5
MSRA he2016deep ResNet-101 34.9 55.7 - -
G-RMI Huang_2017_CVPR Inception-ResNet 34.7 - - -
Table 7: Single-model detection results on COCO test-dev and test-std sets of the current state-of-the-arts.
Figure 5: Examples of our detection results on the COCO test-dev set. Best viewed in color.

5 Conclusions

This paper proposes several improvements on both region proposal and region classification, in particular the novel cascade RPN architecture and improved global context modeling. Due to the computational efficiency, they are applicable in practical scenarios (e.g. embedded platforms) where computational resource is limited. Common training and testing tricks are adopted and systematically evaluated in the context of large scale object detection, which may ease choosing appropriate tricks given a fixed computational budget. Our approach surpasses baseline by a significant margin and achieves the state-of-the-art performance on PASCAL VOC, ILSVRC 2016 and COCO.



  • (1) P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, TPAMI 32 (9) (2010) 1627–1645.
  • (2) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: Integrated recognition, localization and detection using convolutional networks, in: ICLR, 2014.
  • (3) K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: ICLR, 2015.
  • (4) D. Erhan, C. Szegedy, A. Toshev, D. Anguelov, Scalable object detection using deep neural networks, in: CVPR, 2014.
  • (5) C. Szegedy, S. Reed, D. Erhan, D. Anguelov, Scalable, high-quality object detection, arXiv preprint arXiv:1412.1441.
  • (6) J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: CVPR, 2016.
  • (7) W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, Ssd: Single shot multibox detector, in: ECCV, 2016.
  • (8) R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: CVPR, 2014.
  • (9) K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, TPAMI 37 (9) (2015) 1904–1916.
  • (10) S. Ren, K. He, R. Girshick, X. Zhang, J. Sun, Object detection networks on convolutional feature maps, arXiv preprint arXiv:1504.06066.
  • (11) R. Girshick, Fast r-cnn, in: ICCV, 2015.
  • (12) S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in: NIPS, 2015.
  • (13) M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, IJCV 88 (2) (2010) 303–338.
  • (14) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, IJCV 115 (3) (2015) 211–252.
  • (15) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context, in: ECCV, Springer, 2014, pp. 740–755.
  • (16) J. R. Uijlings, K. E. van de Sande, T. Gevers, A. W. Smeulders, Selective search for object recognition, IJCV 104 (2) (2013) 154–171.
  • (17) C. L. Zitnick, P. Dollár, Edge boxes: Locating object proposals from edges, in: ECCV, 2014.
  • (18)

    M.-M. Cheng, Z. Zhang, W.-Y. Lin, P. Torr, Bing: Binarized normed gradients for objectness estimation at 300fps, in: CVPR, 2014.

  • (19) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012.
  • (20) W. Kuo, B. Hariharan, J. Malik, Deepbox: Learning objectness with convolutional networks, in: ICCV, 2015, pp. 2479–2487.
  • (21) B. Yang, J. Yan, Z. Lei, S. Z. Li, Craft objects from images, in: CVPR, 2016.
  • (22) X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, et al., Crafting gbd-net for object detection, arXiv preprint arXiv:1610.02579.
  • (23) A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, L. Van Gool, Deepproposal: Hunting objects by cascading deep convolutional layers, in: ICCV, 2015.
  • (24) S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware cnn model, in: ICCV, 2015.
  • (25) X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, R. Urtasun, 3d object proposals for accurate object class detection, in: NIPS, 2015.
  • (26) Z. Cai, Q. Fan, R. S. Feris, N. Vasconcelos, A unified multi-scale deep convolutional neural network for fast object detection, in: ECCV, 2016.
  • (27) S. Bell, C. L. Zitnick, K. Bala, R. Girshick, Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks, in: CVPR, 2016.
  • (28) W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C.-C. Loy, et al., Deepid-net: Deformable deep convolutional neural networks for object detection, in: CVPR, 2015.
  • (29) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.
  • (30) S. Gross, M. Wilber, Training and investigating residual nets,, accessed: Jul 20, 2017 (2016).
  • (31) K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: ECCV, 2016.
  • (32) J. Hosang, R. Benenson, P. Dollár, B. Schiele, What makes for effective detection proposals?, TPAMI 38 (4) (2016) 814–830.
  • (33) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: CVPR, 2015.
  • (34) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: ACMMM, ACM, 2014, pp. 675–678.
  • (35)

    L. Shen, Z. Lin, Q. Huang, Relay backpropagation for e ective learning of deep convolutional neural networks, in: ECCV, 2016.

  • (36) A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors with online hard example mining, in: CVPR, 2016.
  • (37) J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in: NIPS, 2016.
  • (38) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks., in: AISTATS, Vol. 9, 2010, pp. 249–256.
  • (39) J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, arXiv preprint arXiv:1703.06211.
  • (40) S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep neural networks, in: CVPR, 2017.
  • (41) T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: CVPR, 2017.
  • (42) J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, K. Murphy, Speed/accuracy trade-offs for modern convolutional object detectors, in: CVPR, 2017.