Object detection is a fundamental problem in computer vision research. In recent years, remarkable progress has been made in object detection[16, 7, 26, 21]
, arguably benefited from the rapid development of deep neural network based methods[13, 23, 11, 10, 19, 4]. Among them, one of the most influential methods is the R-CNN framework  which performs CNN-based classification on the object proposals produced by various methods (e.g.  ). As two examples of improvement upon R-CNN, Fast R-CNN 
learns a convolutional feature map from the entire image before extracting features to classify each proposal, and faster R-CNN combines Region Proposal Network (RPN) and Fast R-CNN with shared convolutional layers. Those two variants both bring compelling accuracy and efficiency enhancement for object detection. However, existing R-CNN based methods make predictions for each proposal independently, although surrounding proposals of the same object can provide useful information to refine the proposal location to better cover the object. Moreover, they do not consider segmentation cues which are beneficial for better localizing the objects. In this paper, we aim to further enhance object detection by adopting two strategies, i.e. multi-stage network cascades and group recursive learning for detection refinement.
Multi-Stage Network Cascades
Object detection aims to tightly localize objects of particular categories in an image, while semantic segmentation aims to predict the category label for every pixel of the image. Although the two tasks are typically addressed separately, we argue that the features learned for semantic segmentation tasks could provide valuable cues for more accurately localizing objects — especially for the ones with small scale or occlusion. Therefore, we propose a multi-stage network cascades architecture to jointly perform weakly supervised semantic segmentation and object detection. The proposed architecture consists of three cascaded networks. The first network is for weakly supervised segmentation and learns specific semantic segmentation features from the entire image. The second network generates object proposals by considering both the convolutional features and the produced segmentation features. Better object proposals can thus be generated as foreground and background can be better distinguished using the segmentation related cues. Since there exit large variations in the initial locations of the produced proposals, it is usually hard to make precise predictions for some of the proposals independently with only one step. Thus the third network refines detections recursively based on object proposals produced in the previous stage and global dependency among multiple proposals. In this cascade way, the underlying segments from the semantic segmentation task which can provide local cues for better localization can be inherently integrated for object proposal generation and bounding box prediction. Moreover, precise predictions can be progressively obtained through recursive refinement using the global cues from multiple proposals.
Group Recursive Learning
Most existing approaches for object detection perform category prediction for each object proposal independently without considering the proposals in the vicinity. However, the mutual information among a group of neighboring proposals is quite valuable for getting more accurate detection results. As illustrated in Figure 1, although all of the object proposals have a large overlap with the ground truth, their relative locations to the ground truth and the semantic regions covered are significantly different. Some of the proposals are distant from the ground truth. It becomes difficult for the network to make precise predictions independently with such rough locations. One can observe that for a specific proposal, its surrounding proposals cover different parts of the object. They can provide useful cues to refine the proposal for better concentrating around the actual objects of interest.
Following the above intuition, we propose a group recursive learning approach to progressively refine object detection results in an expectation-maximization like manner. More concretely, in the E-step, given initial detection results, our proposed approach further refines each proposal by taking into consideration the surrounding proposals which have large overlap with the proposal of interest. These proposals are considered so that agroup is formed. All the proposals within the same group collectively refine the proposal of interest to more precise locations. In the M-step, the likelihood of proposals being close to the corresponding ground truth bounding boxes is maximized through the learning process which provides more precise location predictions. This proposed recursive learning procedure is performed iteratively until optimal predictions are achieved.
To summarize, we make the following contributions. (1) We develop a unified multi-stage network cascades architecture that is capable of leveraging semantic segmentation features for object detection. (2) We introduce an EM-like group recursive learning approach to iteratively refine detection results and minimize the offsets between object proposals and the ground truth step by step considering the global dependency among multiple proposals. (3) Our detection architecture achieves competitive mAPs of and on VOC2007 and VOC2012 detection challenges  respectively, which outperforms many well established baselines significantly.
Ii Related Work
In recent years, several works have proposed to incorporate segmentation techniques to assist object detection in different ways. For example, Parkhi et al.  improved the predicted bounding box with color models from predicted rectangles on cat and dog faces. Dai et al.  proposed to use segments extracted for each object detection hypothesis to accurately localize detected objects. Other research has exploited segmentation to generate object detection hypothesis for better localization. Segmentation was adopted as a selective search strategy to generate the best locations for object recognition in . Arbelaez et al.  proposed a hierarchical segmenter that leverages multiscale information and a grouping algorithm to produce accurate object candidates. Instead of using segmentation for better localizing detections, Fidler et al.  took advantage of semantic segmentation results  to more accurately score detections. In this work, we propose a unified framework to incorporate semantic segmentation features for both object proposal generation and better scoring and localizing detections. In addition, a group recursive learning strategy is employed to recursively refine the scores and locations of the detections, thus achieving more precise predictions.
Iii Overview on Multi-stage Object Detection Architecture
Our proposed object detection architecture consists of a cascade of multiple CNN networks, each of which focuses on a specific task, i.e., weakly-supervised semantic segmentation, proposal generation and recursive detection refinement respectively. The three networks share convolutional features learned from the entire image. Details about the proposed architecture are shown in Figure 2
. The input image first passes through several convolutional and max pooling layers to produce convolutional feature maps. Then the semantic segmentation network learns semantic segmentation features for the entire image from the convolutional feature maps. The produced features are then fed into the proposal generation network to generate candidate object proposals. Finally, the recursive detection network iteratively refines the scores and locations of generated object proposals via a group recursive learning strategy. In the following subsections, we explain the multi-stage network cascades, group recursive learning scheme and testing phase with more details.
Iii-a Multi-Stage Network Cascades
Object detection and semantic segmentation are two closely related tasks. The segments extracted for each object proposal can provide useful local cues (e.g., object boundaries) for better object localization. In order to incorporate semantic segmentation cues to assist object detection, we introduce the multi-stage network cascades architecture to jointly perform weakly-supervised semantic segmentation and object detection, in order to learn better image representations for object detection.
Iii-A1 Weakly-Supervised Semantic Segmentation Network
For the semantic segmentation network, we use the semantic segmentation-aware CNN model adopted in  which is trained for the class-specific foreground segmentation task based on a Fully Convolutional Network 
. To avoid using additional segmentation annotations, the network is trained to predict class specific foreground probabilities in a weakly supervised manner with only the provided bounding box annotations for the detection task. The artificial foreground class specific segmentation masks are created using bounding boxes annotations. Specifically, the ground truth bounding boxes of an image are projected on the last hidden layer of the Fully Convolutional Network. The “pixels” inside the projected boxes are labeled as foreground while the rest are labeled as background. This process is performed independently for each class. After the Fully Convolutional Network has been trained on the class-specific foreground segmentation task, we drop the last classification layer and extract the convolutional feature maps output by the last convolutional layer as semantic segmentation features for the input images.
Iii-A2 Proposal Generation Network
Based on the computed feature maps of the input image, the proposal generation network aims to produce a set of object proposals, each of which has a predicted objectness score. Following the Region Proposal Network (RPN) proposed in , the proposal generation network is structured with a convolutional layer followed by a box-regression layer and a box-classification layer. Different from RPN , we incorporate the features learned from the semantic segmentation task which can provide better local cues for objectness prediction and proposal localization. Specifically, we concatenate the semantic segmentation feature maps produced by the semantic segmentation network and the last shared convolutional feature maps along the channel axis, forming the input of the proposal generation network. We minimize an objective function following the multi-task loss in  to optimize the parameters of the network.
Iii-A3 Recursive Detection Network
The structure of the recursive detection network is based on the VGG-16 model , which aims to score the input object proposals and refine their bounding box locations following the Fast R-CNN detection pipeline . Different from Fast R-CNN, segmentation-aware features are constructed to incorporate guidance from the pixel-wise segmentation information which can help better depict the boundaries of the objects to facilitate detection. Specifically, the recursive detection network first utilizes an ROI pooling layer to generate a fixed-length feature descriptor of size from both the semantic segmentation feature maps and the last shared convolutional feature maps for each proposal provided by the proposal generation network. Then, following the feature combination scheme adopted in , we concatenate each pooled feature descriptor along the channel axis and reduce the dimension with a convolution to match the shape of required by the first fully-connected layer (fc6) of the pre-trained VGG-16 model. To match the original amplitudes, each pooled feature map is L2 normalized and re-scaled back up by a fixed scale of . The generated feature is then fed into two fully-connected layers (fc6 and fc7) to predict the confidences over categories, including object classes and one background class, as well as the bounding-box regression offsets. The parameters of these predictors are optimized by minimizing soft-max loss and smooth L1 loss .
Iii-B Group Recursive Learning: An Expectation-Maximization Perspective
The group recursive learning works in an expectation-maximization like way, where the network parameter learning and group recursive refinement are alternatively performed. In particular, in the maximization step, the network is trained to minimize the loss function or equivalently maximize the likelihood of multiple object bounding box predictions. In the expectation step, the locations of the proposals are refined with induced group information. We now proceed to provide more details about the EM-like group recursive learning.
Iii-B1 The M-Step: Mini-Batch Gradient Descent
Specifically, the initial object proposal is denoted as where specifies its pixel coordinates of the center and its width and height in pixels . Each ground-truth bounding box is specified in the same way: . The bounding box regression targets are computed as following the transformation strategy adopted in , in which specifies a scale-invariant translation and log-space height/width shift relative to an object proposal. In the -th iteration, the network takes the refined bounding boxes produced in the -th iteration as input, and predicts bounding-box regression offsets, for each of the object classes, indexed by , and the category-level confidences for categories. Each training proposal is labeled with a ground-truth class and a ground-truth bounding-box regression target . We use a multi-task loss on each object proposal to jointly train for classification and bounding-box regression:
where and are the losses for the classification and the bounding-box regression, respectively. In particular, is log loss for the ground truth class g and is a smooth loss proposed in . The Iverson bracket indicator function equals 1 when and 0 otherwise. For background proposals (i.e. ), the is ignored. After the training process, the loss in the -th iteration will be minimized and the likelihood of the regressed proposals being near to the corresponding ground truth is maximized.
Iii-B2 The E-Step: Group Confidence Pooling
The regressed bounding box of the proposal can be computed as , where represents the inverse operation of . The final bounding box coordinates are further refined by considering the locations of all the surrounding proposals at different parts of the same object through a group confidence pooling scheme. Specifically, for a specific refined proposal , denote as the set of proposals of the same class that have an overlap with of more than 0.7 on IOU metric. The refined location of can be taken as the expected location of the group by regarding the confidence score of each proposal as a weight:
With this group confidence pooling scheme, the proposals will be refined to a better location by taking the surrounding proposals into consideration. The better localized proposals will be given higher confidence scores. As a result, both loss terms in Eqn. (1) will be reduced.
Both the M-step and the E-step optimization can be realized within an end-to-end framework. Assume that the total number of refinement iterations is . During the optimization, we unroll the detection network by stacking detection networks with shared parameters. The global loss is computed as
where (ref. Eqn. (1)) represents the loss produced by the recursive detection network at the -th iteration with refined proposals and denotes the loss output by the proposal generation network following the multi-task loss in . Thus the multi-stage network cascades with group recursive learning can be trained end-to-end jointly.
In testing, given an input image, the proposed framework first generates initial object proposals using the proposal generation network and then recursively passes them into the recursive detection network. At the -th iteration, the recursive detection network predicts the category-level confidences and bounding-box regression offsets for each proposal. The category of the proposal is predicted as the class with the maximum score in . For the proposals predicted as a specific object class, the locations of the proposals are updated by refining the previous location with the predicted bounding-box regression offsets and then performing the group confidence pooling scheme as previously mentioned. For the proposals predicted as the background class, the locations of the proposals are not updated. The final outputs for each proposal are the results in the last iteration , including the predicted category-level confidences and the refined locations .
|Method||P D R||mAP||aero||bike||bird||boat||bottle||bus||car||cat||chair||cow||table||dog||horse||mbike||person||plant||sheep||sofa||train||tv|
|Method||P D R||mAP||aero||bike||bird||boat||bottle||bus||car||cat||chair||cow||table||dog||horse||mbike||person||plant||sheep||sofa||train||tv|
Iv-a Experimental Settings
Datasets and Evaluation Metrics
. The two datasets consist of 9,963 and 22,531 images respectively, and they are divided into train, val and test subsets. The model evaluated on VOC 2007 is trained based on the trainval split from VOC 2007, including 5,011 images, and the trainval split from VOC 2012, including 11,540 images. The model evaluated on VOC 2012 is trained based on all images from VOC 2007, including 9,963 images, and the trainval split from VOC 2012. We use standard evaluation metrics Average Precision (AP) and mean of AP (mAP) following the PASCAL challenge protocols for evaluation.
We initialize the bottom shared convolutional layers and the recursive detection network with the pre-trained VGG-16 model  and initialize the semantic segmentation network with the pre-trained semantic segmentation-aware CNN model in 
. All the other newly added layers are initialized by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01 and 0.001. Our code is based on the publicly available Faster R-CNN framework
built on the Caffe platform. We fine-tune the whole framework jointly following the fine-tuning strategy proposed in . During fine-tuning, images are randomly selected for horizontally flipping with a probability of 0.5 to augment the training data. We set the iteration number for group recursive learning as
, since only minor improvement with more iterations is observed. We run Stochastic Gradient Descent (SGD) for totallyiterations to train the network parameters for VOC 2007 and VOC 2012. The initial learning rate of all layers is set as 0.001 and decreased to one tenth of the current rate after iterations. The model is trained on a NVIDIA GeForce Titan X GPU and Intel Core i7-4930K CPU @ 3.40 GHz.
Iv-B Performance Comparisons
Table I and Table II provide the comparisons of the proposed framework with several state-of-the-art methods    . It can be observed that our method obtains the mAP score of on VOC 2007, which outperforms the two baselines by for Girshick et al.  and for Ren et al. . On VOC 2012, our method outperforms the two baselines: vs of Girshick et al.  and of Ren et al. . In general, the proposed method shows significantly higher performance compared with the baselines and achieves competitive results compared with the state-of-the-art methods on both datasets, which validates its superiority in accurate object detection benefited from the multi-stage network cascades framework and the group recursive learning strategy.
Iv-C Ablation Studies
We further evaluate two important components, i.e. multi-stage network cascades and group recursive learning, to validate their effectiveness.
Multi-stage Network Cascades
We verify the effectiveness of incorporating semantic segmentation features for better object proposal generation and detection using the multi-stage network cascades framework. As shown in Table I, improvement can be observed by incorporating the semantic segmentation features into the proposal generation network compared to the variant without using semantic segmentation features where object proposals and detection results are directly generated based on the last shared convolutional features. Similarly, incorporating the semantic segmentation features into the object detection network offers a further performance increase of . This demonstrates that the proposed multi-stage network cascades framework can effectively leverage the learned features from the semantic segmentation task for object detection, which leads to more accurate bounding boxes for object proposals and provides useful local cues for better object classification and localization.
Group Recursive Learning
In the proposed method, we set the maximal number of iterations for group recursive learning as . To verify the effectiveness of the proposed group recursive learning scheme, we evaluate the performance of the proposed framework with different numbers of iterations during the training and testing stage. In Tabel III, “Iter_1” denotes the variant without using any recursive refinement where detection results are generated with only iteration and “Iter_2” represents the model of using iterations. Compared with ”Iter_1”, “Iter_2” improves the performance by , which verifies that more precise detection results can be obtained benefited from the recursively refined bounding box locations and classification scores. Since no noticeable improvement can be observed by adding more iterations, we use iterations for group recursive learning throughout our experiments.
To verify the advantage of using group recursive learning scheme in both the training and testing stage, we evaluate the performance of the variant where the recursive process is only performed during the testing stage, denoted as ”Iter_2_testing”. As shown in Tabel III, a drop in performance is observed by comparing ”Iter_2_testing” with ”Iter_2”, demonstrating that employing group recursive refinement during both the training and testing stage is beneficial for jointly improving the network capabilities.
Iv-D Detection Error Analysis
We analyze the detection errors of the proposed method using the tool of Hoiem et al. . In Figure 3, we plot pie charts with the percentage of detections that are false positives due to bad localization, confusion with similar categories and other categories, and confusion with background or unlabeled objects. It can be observed that the proposed framework achieves a considerable reduction in the percentage of false positives due to bad localization for challenging categories. This validates that incorporating semantic segmentation features can increase the localization sensitivity of the detection network and precise bounding boxes for the detections can be obtained by adopting the proposed group recursive learning scheme. The similar observation can be deducted from Figure 4 where we plot the top-ranked false positive types of the baseline and of the proposed framework.
Iv-E Qualitative Results
In Figure 5, we provide sample qualitative results that present the iterative bounding box location refinement procedure starting from an initial object proposal produced by the proposal generation network. This example shows that our proposed method is capable of refining the produced initial object proposals step by step to fit them to the ground-truth bounding boxes of different objects, providing accurate object localization.
In this paper, we propose a multi-stage network cascades framework with group recursive learning for object detection. Specially, the proposed framework effectively utilizes semantic segmentation features to assist object detection by incorporating the semantic segmentation network, proposal generation network and recursive detection network into a unified architecture. In addition, a group recursive learning scheme is proposed to recursively score object proposals and regress their bounding boxes considering the locations of the surrounding proposals of the same object. We show that the proposed framework is particularly effective in object localization and achieves competitive results on PASCAL VOC 2007 and 2012.
-  Pablo Arbelaez, Jordi Pont-Tuset, Jon Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, pages 328–335, 2014.
-  Sean Bell, C Lawrence Zitnick, Kavita Bala, and Ross Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. arXiv preprint arXiv:1512.04143, 2015.
-  Joao Carreira, Rui Caseiro, Jorge Batista, and Cristian Sminchisescu. Semantic segmentation with second-order pooling. In ECCV, pages 430–443. 2012.
-  Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. In NIPS, pages 424–432. 2015.
-  Qieyun Dai and Derek Hoiem. Learning to localize detected objects. In CVPR, pages 3322–3329, 2012.
-  Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010.
Sanja Fidler, Sven Dickinson, and Raquel Urtasun.
3d object detection and viewpoint estimation with a deformable 3d cuboid model.In NIPS, pages 611–619. 2012.
-  Sanja Fidler, Roozbeh Mottaghi, Alan Yuille, and Raquel Urtasun. Bottom-up segmentation for top-down detection. In CVPR, pages 3294–3301, 2013.
-  Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region & semantic segmentation-aware cnn model. In ICCV, 2015.
-  Ross Girshick. Fast r-cnn. In CVPR, pages 1440–1448, 2015.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jagannath Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, and Kate Saenko. Lsda: Large scale detection through adaptation. In NIPS, pages 3536–3544. 2014.
-  Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In ECCV, pages 340–353. 2012.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, pages 675–678, 2014.
-  Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, and Horst Bischof. Context-sensitive decision forests for object detection. In NIPS, pages 431–439. 2012.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
-  Omkar M Parkhi, Andrea Vedaldi, CV Jawahar, and Andrew Zisserman. The truth about cats and dogs. In ICCV, pages 1427–1434, 2011.
-  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. arXiv preprint arXiv:1506.02640, 2015.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  Mohammad Saberian and Nuno Vasconcelos. Multi-resolution cascades for multiclass object detection. In NIPS, pages 2186–2194. 2014.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In NIPS, pages 2553–2561. 2013.
-  Jasper RR Uijlings, Koen EA van de Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. IJCV, 104(2):154–171, 2013.
-  Koen EA Van de Sande, Jasper RR Uijlings, Theo Gevers, and Arnold WM Smeulders. Segmentation as selective search for object recognition. In ICCV, pages 1879–1886, 2011.
-  Xiaolong Wang and Liang Lin. Dynamical and-or graph learning for object shape modeling and detection. In NIPS, pages 242–250. 2012.
-  C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405. 2014.