Object detection in images is one of the most widely explored tasks in computer visionhe2017mask ; he2016deep . Existing deep learning approaches to solve this task (e.g., R-CNN girshick2014rich and its variants girshick2015fast ; ren2015fasterrcnn ; he2017mask
) mainly rely on region proposal mechanisms (e.g., region proposal networks (RPNs)) to generate potential bounding boxes in an image and then classify these bounding boxes to achieve object detection. Although such mechanisms can generally achieve a good detection performance under normal circumstances, their recall in a scene with extreme cases (e.g., complex occlusion (Fig.1(a)), poor illumination (Fig. 1(b)), and large-scale small objects (Fig. 1(c))) is unacceptably low.
Specifically, detecting objects under extreme cases via region proposal mechanisms encounters two challenges: First, the performance of region proposal mechanisms highly depends on the purity of bounding boxes guo2016deep ; however, the annotated bounding boxes in extreme cases usually contain much more environment noise than those in normal cases. This inevitably increases the difficulty of model learning and decreases the resulting confidence scores of bounding boxes, which consequently weakens the detection performance. Second, non-maximum suppression (NMS) operations are used in region proposal mechanisms to select target boxes by setting an intersection over union (IoU) threshold to filter other bounding boxes. However, it is very hard (and sometimes even impossible) to find an appropriate threshold to adapt to the very complex situations in extreme cases.
Motivated by this, in this work, we propose a weakly supervised multimodal annotation segmentation (WSMA-Seg) approach, which uses segmentation models to achieve an accurate and robust object detection without NMS. It consists of two phases, namely, a training and a testing phase. In the training phase, WSMA-Seg first converts weakly supervised bounding box annotations in detection tasks to multi-channel segmentation-like masks, called multimodal annotations; then, a segmentation model is trained using multimodal annotations as labels to learn multimodal heatmaps for the training images. In the testing phase, the resulting heatmaps of a given test image are converted into an instance-aware segmentation map based on a pixel-level logic operation; then, a contour tracing operation is conducted to generate contours for objects using the segmentation map; finally, bounding boxes of objects are created as circumscribed quadrilaterals of their corresponding contours.
WSMA-Seg has the following advantages: (i) as an NMS-free solution, WSMA-Seg avoids all hyperparameters related to anchor boxes and NMS; so, the above-mentioned threshold selection problem is also avoided; (ii) the complex occlusion problem can be alleviated by utilizing the topological structure of segmentation-like multimodal annotations; and (iii) multimodal annotations are pixel-level annotations; so, they can describe the objects more accurately and overcome the above-mentioned environment noise problem.
Furthermore, it is obvious that the performance of the proposed WSMA-Seg approach greatly depends on the segmentation performance of the underlying segmentation model. Therefore, in this work, we further propose a multi-scale pooling segmentation (MSP-Seg) model, which is used as the underlying segmentation model of WSMA-Seg to achieve a more accurate segmentation (especially for extreme cases, e.g., very small objects), and consequently enhances the detection accuracy of WSMA-Seg.
The contributions of this paper are briefly as follows:
We propose a weakly supervised multimodal annotation segmentation (WSMA-Seg) approach to achieve an accurate and robust object detection without NMS, which is the first anchor-free and NMS-free object detection approach.
We propose multimodal annotations to achieve an instance-aware segmentation using weakly supervised bounding boxes; we also develop a run-data-based following algorithm to trace contours of objects.
We propose a multi-scale pooling segmentation (MSP-Seg) model to achieve a more accurate segmentation and to enhance the detection accuracy of WSMA-Seg.
We have conducted extensive experimental studies on the Rebar Head, WIDER Face, and MS COCO datasets; the results show that the proposed WSMA-Seg approach outperforms the state-of-the-art detectors on all testing datasets.
2 Weakly Supervised Multimodal Annotation Segmentation
In this section, we introduce our approach to object detection using weakly supervised multimodal annotation segmentation (WSMA-Seg). WSMA-Seg generally consists of two phases: a training phase and a testing phase. In the training phase, as shown in Figure 2, WSMA-Seg first converts the weakly supervised bounding box annotations to pixel-level segmentation-like masks with three channels, representing interior, boundary, and boundary on interior masking information, respectively; the resulting annotations are called multimodal annotations; then, multimodal annotations are used as labels to train an underlying segmentation model to learn corresponding multimodal heatmaps for the training images. In the testing phase, as shown in Figure 3, we first send the given testing image into the well-trained segmentation model to obtain multimodal heatmaps; then, the resulting three heatmaps are converted into an instance-aware segmentation map based on a pixel-level logic operation; finally, a contour tracing operation is conducted to generate contours for objects using the segmentation map, and the bounding boxes of objects are created as circumscribed quadrilaterals of their contours. The rest of this section will introduce the main ingredients of WSMA-Seg.
2.1 Generating Multimodal Annotations
Pixel-level segmentation annotations are much more representative than bounding box annotations, so they can resolve some extreme cases that are challenging for bounding box annotations. However, creating well-designed pixel-level segmentation masks is very time-consuming, which is about times of creating bounding box annotations lin2014microsoft . Therefore, in this work, we propose a methodology to automatically convert bounding box annotations to segmentation-like multimodal annotations, which are pixel-level geometric segmentation-like multichannel annotations. Here, “geometric segmentation-like” means that the multimodal annotations are not strict segmentation annotations; rather, they are annotations generated from simple geometries, e.g., inscribed ellipses of bounding boxes. This is motivated by the finding in dai2015boxsup that pixel-level segmentation information is not fully utilized by segmentation models; we thus believe that well-designed pixel-level segmentation annotations may not be essential to achieve a reasonable performance; rather, pixel-level geometric annotations should be enough. Furthermore, to generate a bounding box for each object in the image, an instance-aware segmentation is required; to achieve this, multimodal annotations are designed to have multiple channels to introduce additional information.
Specifically, as shown in Figure 2, multimodal annotations use three channels to represent pixel-level masking information regarding the interior, the boundary, and the boundary on the interior of geometries. These three different pixel-level masks are generated as follows: Given an image with bounding box annotations, we first obtain an inscribed ellipse for each bounding box, then the interior mask (channel ) is obtained by setting the values of pixels on the edge of or inside the ellipses to , and setting the values of other pixels to . Then, the boundary mask (channel ) is obtained by setting the values of pixels on the edge of or within the inner width of the ellipses to , and setting the rest to . Similarly, the boundary on the interior mask (channel ) is generated by setting the values of pixels on the edge of or within the inner width of the area of the elliptical overlap to .
2.2 Multi-Scale Pooling Segmentation
It is obvious that the performance of the proposed WSMA-Seg approach greatly depends on the segmentation performance of the underlying segmentation model. Therefore, in this work, we further propose a multi-scale pooling segmentation (MSP-Seg) model, which is used as the underlying segmentation model of WSMA-Seg to achieve a more accurate segmentation (especially for extreme cases, e.g., very small objects), and to consequently enhance the detection accuracy of WSMA-Seg.
As shown in Figure 4, MSP-Seg is an improved segmentation model of Hourglass newell2016stacked . The main improvement of MSP-Seg is to introduce a multi-scale block on the skip connections, performing multi-scale pooling operations to the output feature maps of residual blocks. Specifically, as shown in Figure 5, multi-scale pooling utilizes four pooling kernals with sizes , , , and
to simultaneously conduct average pooling operations on the previous feature maps generated by residual blocks on skip connections. Then, four feature maps generated by different pooling channels are concatenated to form a new feature map whose number of channels is four times of the previous feature maps. Here, to ensure that the four feature maps have the same size, the stride is set to
, and zero-padding is conducted. Finally, we applyconvolution to restore the number of channels, and element-wise addition to merge the feature maps. As shown in Figure 4, by using multimodal annotations as labels, MSP-Seg is trained to learn three heatmaps for each image, which are called interior heatmap, boundary heatmap, and boundary on interior heatmap, respectively.
Intuitively, multi-scale pooling is capable of enhancing the segmentation accuracy, because it combines features of different scales to obtain more representative feature maps. Please note that, as a highly accurate segmentation model, MSP-Seg can be widely applied to various segmentation tasks.
2.3 Object Detection Using Segmentation Results and Contour Tracing
After obtaining a well-trained segmentation model, we are now able to conduct object detection. As shown in Figure 3, given a test image as the input of the segmentation model, WSMA-Seg first generates three heatmaps, i.e., interior, boundary, and boundary on interior heatmaps, which are denoted as I, B, and O, respectively. These three heatmaps are then converted to binary heatmaps, where the values of pixels in interested area are set to , and the rest is set to . This conversion is conducted following the approach in suzuki1985topological . Furthermore, a pixel-level operation, , is used to merge three heatmaps into an instance-aware segmentation map.
Finally, a contour tracing operation is conducted to generate contours for objects using the instance-aware segmentation map, and the bounding boxes of objects are created as circumscribed quadrilaterals of their contours. One conventional way to trace a contour is to use scan-based-following algorithm suzuki1985topological . However, in the case of a large image with many objects (which is common in detection tasks), scan-based-following algorithm is very time consuming.
Therefore, motivated by the work in agrawala1977sequential , we propose a modified run-data-based (RDB) following algorithm, which greatly reduces the time and memory costs of the contour tracing operation. Pseudocode of the RDB following algorithm is shown in Algorithm 1 and an example is shown in Figure 6. Differently from the pixel-following algorithm that requires to scan the entire image to find the starting point and tracing contour pixels along the clockwise direction to generate the results recurrently, the RDB following algorithm only needs to save two lines of pixel values and to scan the whole image once, which significantly reduces the memory consumption and increases the speed.
Specifically, RDB following algorithm first initialize two variables and with null value, then scans the binary instance-aware segmentation map row by row from the top-left corner to the bottom-right corner to find contours (lines -). If a pixel’s value is and its left pixel’s value is , then this pixel is on the left side of a contour, so it is assigned to ; similarly, if a pixel’s value is and its right pixel’s value is , then this pixel is on the right side of a contour, so it is assigned to (lines -). When both and are found, we check if there exists a pair of and on above line whose x-coordinates are the same as or greater/smaller by than the corresponding x-coordinates of and ; if so, we add and to the same contour set as and ; otherwise, we create a new contour set and add and to it (lines -).
To show the strength of our proposed WSMA-Seg approach in object detection, extensive experimental studies have been conducted on three benchmark datasets, namely, the Rebar Head111https://www.datafountain.cn/competitions/332/details, WIDER Face222http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/, and MS COCO datasets333http://cocodataset.org/, each of which containing many extreme cases. The important parameters of WSMA-Seg are as follows: Stack is the number of the stacked hourglass networks (see newell2016stacked for more details about hourglass), Base is a pre-defined basic number, and the number of channels is always an integer multiple of Base, and Depth is the number of down-samplings. Stem represents three consecutive convolution operations with stride before the first stack.
3.1 Rebar Head Detection
We first conduct experiments on the Rebar Head detection dataset, which consists of training images (including a total of rebar heads) and testing images. The orignal resolution of the whole image is . Performing object detection on this dataset is very challenging, because it only contains a few training samples and also encounters very severe occlusion situations (see Figure 7). In addition, the target rebar heads are very small: the average area of each box is pixels, taking up only 0.13% of the whole image. The images are also poorly annotated and rich in diverse illuminations.
Two state-of-the-art anchor-based models, Faster R-CNN ren2015fasterrcnn and Cascade R-CNN cai2018cascade , are selected as the baselines. Table 1 shows the detection performances of our proposed WSMA-Seg and baselines on this dataset. As shown in Table 1, our proposed method with , , has achieved the best performance among all solutions in terms of F1 Score. In addition, the number of parameters needed for WSMA-Seg is much less than the baselines (only of Cascade RCNN and
of Faster RCNN), while the number of training epochs for WSMA-Seg is also less than those of the baselines. Therefore, we can conclude that, compared to the state-of-the-art baselines, WSMA-Seg is much simpler, more effective, and more efficient.
|Cascade RCNN cai2018cascade||42.1M||100||98.70%|
3.2 WIDER Face Detection
We further conduct experiments on the WIDER Face detection datasetyang2016wider , which consists of images and faces. Face detections in this dataset are extremely challenging due to a high degree of variability in scale, pose, and occlusion. WIDER Face results in a much lower detection accuracy compared to other face detection datasets. WIDER Face has defined three levels of difficulties (i.e., Easy, Medium, and Hard), based on the detection accuracies of EdgeBox zitnick2014edge . Furthermore, the dataset also treats occlusion as an additional attribute and is partitioned into three categories: no occlusion, partial occlusion, and heavy occlusion. Specifically, a face is categorized as partial occlusion when 1% to 30% of the total face area is occluded, and a face with the occluded area over 30% is categorized as heavy occlusion. The size of the training set is , that of the validation set is , and that of the testing set is .
Twelve state-of-the-art approaches are selected as baselines, namely, Two-stage CNN, Cascade R-CNN, and LDCF+ohn2016boost , multitask Cascade CNN zhang2016joint , ScaleFace yang2017face , MSCNN cai2016unified , HR hu2017finding , Face R-CNN wang2017detecting , Face Attention Networks wang2017face , and PyramidBox tang2018pyramidbox . The experimental results in terms of F1 score are shown in Table 8. The results show that our proposed WSMA-Seg outperforms the state-of-the-art baselines in all three categories, reaching , , and in Easy, Medium, and Hard categories, respectively.
3.3 MS COCO Detection
Finally, we conduct experimental studies on the MS COCO detection dataset lin2014microsoft , which is one of the most popular large-scale detection datasets. Our results are obtained using the test-dev split (20k images) with a host of the detection method. We have constructed the training set with samples, the validation set with samples, and the testing set with samples. We use the metrics as used in lin2014microsoft to characterize the performance. Four types of metrics are defined and described as follows:
Average Precision (AP):
: AP at IoU=.50:.05:.95 (primary challenge metric)
: AP at IoU=.50 (PASCAL VOC metric)
AP at IoU=.75 (strict metric)
AP Across Scales:
: AP for small objects: area
: AP for medium objects: area
: AP for large objects: area
Average Recall (AR):
: AR given 1 detection per image
: AR given 10 detections per image
: AR given 100 detections per image
AR Across Scales:
: AR for small objects: area
: AR for medium objects: area
: AR for large objects: area
Seven state-of-the-art solutions are selected as baselines, and the experimental results for four types of metrics are shown in Tables 2 and 3. The results show that our WSMA-Seg approach outperforms all state-of-the-art baselines in terms of most metrics, including the most challenging metrics, , , , and . For the other metrics, the performance of our proposed approach is also close to those of the best baselines. This proves that the proposed WSMA-Seg approach generally achieves more accurate and robust object detection than the state-of-the-art approaches without NMS.
|Faster R-CNN w/ TDM||Inception-ResNet-v2||36.8||57.7||39.2||16.2||39.8||52.1|
|Faster R-CNN w/ TDM||Inception-ResNet-v2||31.6||49.3||51.9||28.1||56.6||71.1|
In this work, we have proposed a novel approach to object detection in images, called weakly supervised multimodal annotation segmentation (WSMA-Seg), which is anchor-free and NMS-free. We observed that NMS is one of the bottlenecks of existing deep learning approaches to object detection in images. The need to tune hyperparameters on NMS has seriously hindered the scalability of high-performance detection frameworks. Therefore, to realize WSMA-Seg, we proposed to use multimodal annotations to achieve an instance-aware segmentation based on weakly supervised bounding boxes, and developed a run-data-based following algorithm to trace contours of objects. In addition, a multi-scale pooling segmentation (MSP-Seg) model was proposed as the underlying segmentation model of WSMA-Seg to achieve a more accurate segmentation and to enhance the detection accuracy of WSMA-Seg. Experimental results on multiple datasets concluded that the proposed WSMA-Seg approach is superior to the state-of-the-art detectors.
- (1) K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on Computer Vision, 2017.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016.
- (3) R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of Computer Vision and Pattern Recognition, 2014.
- (4) R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
- (5) S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 91–99. [Online]. Available: http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
- (6) Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, “Deep learning for visual understanding: A review,” Neurocomputing, vol. 187, pp. 27–48, 2016.
- (7) T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of European Conference on Computer Vision, 2014.
- (8) J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” inProceedings of European Conference on Computer Vision, 2016.
- (10) S. Suzuki et al., “Topological structural analysis of digitized binary images by border following,” Computer vision, graphics, and image processing, vol. 30, no. 1, pp. 32–46, 1985.
- (11) A. K. Agrawala and A. V. Kulkarni, “A sequential approach to the extraction of shape features,” Computer Graphics and Image Processing, vol. 6, no. 6, pp. 538–557, 1977.
- (12) Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of Computer Vision and Pattern Recognition, 2018.
- (13) S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection benchmark,” in Proceedings of the IEEE conference on Computer vision and Pattern Recognition, 2016.
- (14) C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in Proceedings of European conference on computer vision, 2014.
- (15) E. Ohn-Bar and M. M. Trivedi, “To boost or not to boost? on the limits of boosted trees for object detection,” in Proceedings of International Conference on Pattern Recognition, 2016.
- (16) K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
- (17) S. Yang, Y. Xiong, C. C. Loy, and X. Tang, “Face detection through scale-friendly deep convolutional networks,” arXiv preprint arXiv:1706.02863, 2017.
Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified multi-scale deep convolutional neural network for fast object detection,” inProceedings of European conference on computer vision, 2016.
- (19) P. Hu and D. Ramanan, “Finding tiny faces,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.
- (20) Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li, “Detecting faces using region-based fully convolutional networks,” arXiv preprint arXiv:1709.05256, 2017.
- (21) J. Wang, Y. Yuan, and G. Yu, “Face attention network: an effective face detector for the occluded faces,” arXiv preprint arXiv:1711.07246, 2017.
- (22) X. Tang, D. K. Du, Z. He, and J. Liu, “Pyramidbox: A context-assisted single shot face detector,” in Proceedings of the European Conference on Computer Vision, 2018.