BAN: Focusing on Boundary Context for Object Detection

11/13/2018
by   Yonghyun Kim, et al.
POSTECH
6

Visual context is one of the important clue for object detection and the context information for boundaries of an object is especially valuable. We propose a boundary aware network (BAN) designed to exploit the visual contexts including boundary information and surroundings, named boundary context, and define three types of the boundary contexts: side, vertex and in/out-boundary context. Our BAN consists of 10 sub-networks for the area belonging to the boundary contexts. The detection head of BAN is defined as an ensemble of these sub-networks with different contributions depending on the sub-problem of detection. To verify our method, we visualize the activation of the sub-networks according to the boundary contexts and empirically show that the sub-networks contribute more to the related sub-problem in detection. We evaluate our method on PASCAL VOC detection benchmark and MS COCO dataset. The proposed method achieves the mean Average Precision (mAP) of 83.4 VOC and 36.9 additional source of contexts for detection and selectively focus on the more important contexts, and it can be generally applied to many other detection methods as well to enhance the accuracy in detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 14

08/15/2018

SAN: Learning Relationship between Convolutional Features for Multi-Scale Object Detection

Most of the recent successful methods in accurate object detection build...
02/06/2018

Face Detection Using Improved Faster RCNN

Faster RCNN has achieved great success for generic object detection incl...
04/24/2015

Situational Object Boundary Detection

Intuitively, the appearance of true object boundaries varies from image ...
08/25/2021

GlassNet: Label Decoupling-based Three-stream Neural Network for Robust Image Glass Detection

Most of the existing object detection methods generate poor glass detect...
02/25/2021

SCD: A Stacked Carton Dataset for Detection and Segmentation

Carton detection is an important technique in the automatic logistics sy...
09/09/2016

The Role of Context Selection in Object Detection

We investigate the reasons why context in object detection has limited u...
01/18/2020

Boundary Value Exploration for Software Analysis

For software to be reliable and resilient, it is widely accepted that te...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is one of the core problem among computer vision tasks because of its extensiveness of applicable areas, such as robotics, visual surveillance and autonomous safety. In recent years, there have been outstanding achievements in objects detection by successfully deploying a convolutional neural network 

[12, 15, 16, 19, 21, 22, 23]. Despite its success, there is still a gap between current state-of-the-art performance and perfectness, and many challenging problems remain unsolved.

Figure 1: Three advantages of boundary contexts: (1) The boundary contexts provide information that could be lost due to mis-aligned proposals for more accurate classification and localization. (2) Depending on the sub-problem, the importance of the context may be differently weighted. The detector can localize more accurately by focusing on a specific area. (3) As the nearby objects are included or excluded by the context, the relationship between the object of the proposal and the nearby objects can be considered. For example, a person on a horse has a valid relationship, but a horse on a chair has a invalid relationship.

Visual context is a powerful clue for object detection and the context around boundaries of an object such as the surroundings and the shape of the object is especially valuable. Many advantages can be expected by exploiting the boundary contexts in addition to a given proposal for detection (Fig. 1). The detection frameworks search the objects across the proposals generated from region proposal algorithms such as selective search [26], edge boxes [29] and region proposal network [23]. However, mis-aligned proposals with large differences in the location and size of objects may cause difficulties in detection due to the lack of information. The boundary context can be an additional source of information for detection and this contexts allow the detector to selectively focus on more important contexts depending on the sub-problem. The entire network includes and excludes the relationship of the surrounding context, thereby focusing on the partial detail of the object or considering the relationships between objects.

We propose a boundary aware network (BAN) designed to consider the boundary contexts and empirically prove the effectiveness of BAN. BAN efficiently represents the relationship between the boundary contexts by implementing the contexts as different sub-networks and improves the accuracy in detection. We use a total of 10 boundary contexts from the three different types of pre-defined boundary contexts: side, vertex and in/out-boundary context. Our BAN consists of 10 corresponding sub-networks for the area belonging to the boundary contexts. The detection head of BAN is defined as an ensemble of these sub-networks with different contributions depending on the sub-problem. We prove the validity of our methods by visualizing the activation of BAN and measuring the contribution of BAN’s sub-networks.

We conduct experiments on two different datasets of object detection and experiments for the strategies for BAN such as a combination of boundary contexts, a feature resolution of sub-networks and sharing of features. The proposed BAN shows the improvement of 3.2 mean Average Precision (mAP) with a threshold of 0.5 IoU from R-FCN [15] and 1.2 mAP from Deformable R-FCN [4] on PASCAL VOC [10], and the improvement of 4.5 COCO-style mAP from R-FCN and 2.4 COCO-style mAP from Deformable R-FCN on MS COCO [18]. The experiments verify that BAN improves the accuracy in detection and each boundary context have a distinct meaning for detection.

We make three main contributions:

  • [leftmargin=+.4in,label=]

  • We develop the boundary aware network to consider the boundary contexts around the given proposal and study empirically the influence of the boundary context on classification and bounding box regression. Our BAN makes it possible to detect objects more accurately by combining sub-networks of different importance according to the detection head.

  • We empirically demonstrate the effectiveness of BAN for object detection. We visualize the activation of the sub-networks according to the boundary contexts and empirically prove that the boundary contexts of BAN contribute more strongly to the detection head if they are intuitively related to each other. These related contributions suggest that BAN implies distinct meanings than naive ensemble of sub-networks.

  • BAN allows the convolution network to provide an additional source of contexts for detection and selectively focus on more important contexts, and it can be generally applied to many other detection method as well to enhance the accuracy in detection.

This paper is organized as follow. We review the related works in Section 2. We demonstrate the proposed BAN and show the effectiveness of BAN in Section 3. We conduct experiments on two object detection datasets and also present several experiments on the strategies for BAN in Section 4. We conclude in Section 5.

2 Related Works

Classic Object Detectors.

The sliding-window paradigm, in which a classifier is applied on a dense image pyramid 

[1, 13], have been used for a long time to localize objects of various sizes. Viola and Jones [27]

used adaptive boosting with Haar features and a decision stump as a weak classifier primarily for face detection. Dalal and Triggs 

[5]

constructed human detection framework with HOG descriptors and a support vector machine. Dollár

et al. [9]

developed integral channel features, which extract features from channels such as LUV and gradient histogram with integral images, with a boosted decision tree for pedestrian detection. They expanded it to aggregated channel features and a feature pyramid 

[8] for fast and accurate detection framework. Deformable parts model (DPM) [11, 28]

extend conventional detectors to more general object categories by modelling an object as a set of parts with spatial constraints. While the sliding-window based approaches had been mainstream for many years, the advances in deep learning lead CNN-based detectors, described next, to dominate object detection.

Modern Object Detectors. The dominant paradigm in modern object detection is a two-stage object detection approach that generates candidate proposals in the first stage and classifies the proposals to the background and foreground classes in the second stage. The first-stage generators should provide high recall and more efficiency than a sliding window and directly affect the detection accuracy of the second-stage classifiers. The representative region proposal approaches are selective search [26], edge boxes [29] and region proposal network (RPN) [23]. As the representative two-stage object detection framework, Fast and Faster R-CNN [12, 23] proposed the standard structure of CNN-based detection and show good accuracy in detection. These methods extract RoI-wise convolutional features by RoI pooling and classify RoIs of the proposals to the background and foreground classes using RoI-wise sub-networks. Region-based fully convolutional networks (R-FCN) [15] improved speed by designing the structure of networks as fully convolutional by excluding RoI-wise sub-networks. However, two-stage decision makes the detectors not practical enough. One-stage detectors such as SSD [19] and YOLO [22] showed practical performance by focusing on the speed/accuracy trade-off. These detectors have a 5-20% lower accuracy in detection with 30-100 FPS. We experiment our BAN with R-FCN and show the improvement in the detection accuracy.

Residual Network. The residual network [14], one of the most widely used backbone networks in recent years, was proposed to solve the problem that learning becomes difficult as the network becomes deeper. Against the expectation that stacking more layers increases accuracy with more capacity, deeper networks exposed to a degradation of both training and test accuracy. The degradation of training accuracy implies that the difficulty of learning from deep structures, rather than over-fitting, causes the degradation. The residual learning prevents the deeper networks from having a higher training error than the shallower networks by adding shortcut connections that are identity mapping. It is easier for the residual block to learn the residual to zero than to learn the desired mapping function directly. By designing the desired mapping as a residual function, the residual block makes learning easier for deeper networks.

Detection with Context. Context is an important clue in the applications of computer vision such as detection [6, 7], segmentation [2] and recognition [3]. Ding et al. [6] designed the contextual cues in spatial, scaling and color spaces and developed an iterative classification algorithm called contextual boost. AZ-Net [20] accurately localizes an object by dividing and detecting the region recursively. Because the divided regions quite differ from the object area at first, it uses the inner and surrounding contexts to iteratively complement the imperfectness of the regions. Deformable R-FCN [4] is a generalization of atrous convolution. It partially includes the effect of the visual context by exploring the surrounding at the cell level. FPN [16]/RetinaNet [17] exploit the contexts for scale by aggregating multi-scale convolutional blocks. These methods try to consider the contextual cues in various ways, however, they partially exploit the visual context. BAN provides the distinct context more directly for surroundings and can improve the performance of various detectors easily.

Figure 2: Overview of the proposed BAN with a classifier head for classes. Our detection architecture classifies and localizes an object from a proposal by integrating sub-networks representing difference boundary contexts.

3 Boundary Aware Network

We propose a boundary aware network (BAN) to exploit the contexts for boundary information and surroundings, named boundary context, and define three types of the boundary contexts: side, vertex and in/out-boundary context. Visual context [2, 3, 6, 7] is one of the important clue for object detection. Because most of the detection frameworks pool convolutional features only from the proposal area, it is difficult to directly consider the areas not included exactly in the proposal and the relationship with the surroundings. The proposed BAN enhance the accuracy in detection by ensembling sub-networks that directly use boundary context of the proposal as additional information.

Here, is one of the proposals for a given image  and is a set of the boundary contexts . denotes a generator that provides the boundary region related to . The classifier and regressor , that are the aggregation of detection  for the original proposal and detection  for BAN that integrates corresponding sub-networks  of each boundary context, are defined in the following form:

(1)

is empirically built according to pooling methods such as RoI pooling and PSRoI pooling. In PSRoI pooling based implementation, each of is a detection head and is a simple aggregation of the detection heads, and is defined as the aggregation of baseline and sub-networks of BAN in Eq. 1. Thus, the propagated errors are equally transferred to each sub-network in the back-propagation:

(2)

Because the error of the upper layer is propagated equally to each sub-network, sub-networks are learned in a balanced manner considering the importance of each context for the same goal. In Section 3.3, we show that each sub-network of BAN actually contributes more to the related sub-problem.

Figure 3: Three types of boundary contexts: (1) Side contexts represent areas centered on each side of the proposal and imply the relationship with the nearby objects and localization in the vertical and horizontal direction, (2) Vertex contexts represent areas centered on each vertices of the proposal and imply the relationship with the nearby objects and localization in the diagonal direction, (3) In and Out-boundary contexts represent the inner or outer region around the boundaries of the proposal and imply the detail or the relationship with surrounding objects.

3.1 Architecture

We use a fully convolutional network that excludes the average pooling, 1000-d fully connected and softmax layers from ResNet-101 

[14] as backbone. Each sub-network in BAN takes a prediction map by stacking convolution from the backbone network and uses PSRoI pooling [15] to calculate the objectiveness and bounding box of the given proposals. We employ 10 different sub-networks to deal with different boundary regions generated from for the boundary contexts. BAN classifies and regresses a objectiveness and a bounding box of the proposal through a detection head that is an ensemble of 11 sub-networks’ predictions including a sub-network for the original proposal (Fig. 2) . In the learning process, each sub-network is not learned to have the same importance, but is learned to have different magnitudes of contribution according to the sub-problems such as classification of person and relative regression of width, although it is a simple aggregation.

3.2 Boundary Context

We use a total of 10 boundary contexts from three different types of pre-defined boundary contexts: side, vertex and in/out-boundary context (Fig. 3). The RoIs for side contexts are defined as regions having the same height and 2/3 width of the proposal, centered at each left and right side of the proposal and regions having 2/3 height and same width of the proposal, centered at the other parallel sides. The RoIs for the vertex contexts are defined as the regions having 2/3 of height and width of the proposal and are centered at each vertex of the proposal. The RoIs for the in and out-boundary contexts are defined as a half-size region and a double-size region, respectively, sharing center point with the proposal.

Figure 4: Illustration of BANs for two related objects (person and horse) in a given image. We visualize BAN with Contribution and Local Activation to show it’s effectiveness more directly.
Base In Out
bkgd 0.070 0.138 0.056 0.061 0.059 0.038 0.034 0.042 0.035 0.328 0.140
aero 0.069 0.112 0.082 0.103 0.101 0.056 0.062 0.047 0.075 0.137 0.156
bike 0.110 0.157 0.112 0.082 0.071 0.043 0.054 0.048 0.060 0.207 0.056
bird 0.131 0.121 0.094 0.101 0.095 0.062 0.054 0.060 0.055 0.172 0.054
boat 0.105 0.130 0.086 0.098 0.097 0.064 0.045 0.062 0.064 0.116 0.133
bottle 0.124 0.093 0.083 0.076 0.106 0.051 0.048 0.055 0.049 0.227 0.089
bus 0.074 0.127 0.084 0.089 0.078 0.054 0.053 0.053 0.059 0.209 0.121
car 0.095 0.128 0.081 0.099 0.111 0.063 0.071 0.056 0.075 0.157 0.063
cat 0.187 0.119 0.076 0.072 0.081 0.056 0.042 0.047 0.053 0.209 0.058
chair 0.112 0.120 0.115 0.104 0.105 0.058 0.057 0.055 0.060 0.144 0.070
cow 0.079 0.163 0.077 0.084 0.093 0.058 0.043 0.048 0.044 0.255 0.056
table 0.093 0.108 0.103 0.093 0.099 0.060 0.057 0.053 0.053 0.222 0.058
dog 0.088 0.108 0.088 0.079 0.085 0.060 0.045 0.049 0.057 0.233 0.107
horse 0.085 0.137 0.072 0.082 0.074 0.058 0.039 0.054 0.047 0.256 0.097
mbike 0.124 0.115 0.072 0.088 0.092 0.051 0.059 0.048 0.066 0.225 0.060
person 0.138 0.148 0.080 0.114 0.117 0.061 0.066 0.062 0.063 0.105 0.046
plant 0.096 0.149 0.092 0.095 0.093 0.059 0.067 0.050 0.073 0.148 0.078
sheep 0.109 0.142 0.112 0.107 0.118 0.065 0.053 0.051 0.059 0.130 0.054
sofa 0.107 0.144 0.080 0.088 0.086 0.068 0.040 0.066 0.046 0.154 0.120
train 0.075 0.097 0.083 0.090 0.084 0.058 0.060 0.055 0.069 0.141 0.190
tv 0.100 0.134 0.116 0.104 0.106 0.063 0.060 0.058 0.057 0.111 0.091
Table 1: Contribution of BAN’s sub-networks for classification in PASCAL VOC. Base represents the sub-network representing the original proposal, each arrows represent the side and vertex context located in the corresponding direction, and In and Out represent the in/out-boundary contexts
Base In Out
0.181 0.088 0.077 0.118 0.137 0.058 0.062 0.053 0.058 0.077 0.090
0.094 0.123 0.096 0.046 0.046 0.037 0.041 0.033 0.045 0.055 0.384
0.118 0.065 0.071 0.123 0.132 0.068 0.070 0.059 0.089 0.100 0.104
0.089 0.131 0.105 0.051 0.044 0.038 0.042 0.034 0.044 0.212 0.209
Table 2: Contribution of BAN’s sub-networks for localization in PASCAL VOC

3.3 Visualization of BAN

We visualize the response of feature map that is activated on the closer area to the related object (Fig. 4) to show the effectiveness of BAN. Contribution shows that BAN is weighted more strongly to the related instance rather than backgrounds and Local Activation shows that the context is activated closer to the target. We also measured the classification contribution of BAN’s sub-networks (Table 1

). The contributions are almost uniformly distributed due to large variations of the objects, but the boundary contexts of

and In, which can include the representative part such as head and detail, show a slightly larger contribution. The localizations contributions demonstrate that BAN works faithfully in considering the boundary context (Table 2). The regression in vertical direction such as and , are highly contributed by and . In/Out-boundary contexts do not have a specific tendency but show a high contribution. We infer that the redundancy of the regions for base, in and out makes them play a similar role. We construct both visualization and contributions using PSRoI pooling based BAN for intuitive comparison.

4 Experiments

We conduct experiments on two different datasets of object detection and experiments with the strategies for BAN such as a combination of boundary contexts, a feature resolution of sub-networks and sharing of features. Our BAN shows the improvement of 3.2 mAP with a threshold of 0.5 IoU from R-FCN [15] and 1.2 mAP from Deformable R-FCN [4] on PASCAL VOC [10], and the improvement of 4.5 COCO-style mAP from R-FCN and 2.4 COCO-style mAP from Deformable R-FCN on MS COCO [18]. The experiments show that BAN improves the detection accuracy of object detection and implies that the boundary contexts has a distinct meaning for detection among each other.

Figure 5: Detail structure of BAN for a classifier head of classes (a regression head of offsets is also similarly defined). The structure of BAN is empirically determined according to pooling methods (PSRoI pooling and RoI pooling).

4.1 Implementation

Baseline. We use a fully convolutional network [15] that excludes the average pooling, 1000-d fully connected and softmax layers from ResNet-101 [14]. The last convolution block res5

in ResNet-101 has a stride of 32 pixels. Many detection and segmentation methods employ a modified ResNet-101 that increases the receptive fields by changing the stride from 2 to 1. To compensate this modification, the dilation is changed from 1 to 2 for all

convolution in the last layer. The last convolution block res5

in modified ResNet-101 has a stride of 16 pixels and we use this as backbone. We fine-tune the model from the pre-trained ResNet-101 model on ImageNet 

[24].

Structure. BAN can be implemented using any pooling methods such as RoI pooling and PSRoI pooling (Fig. 5). We empirically determine the structure of BAN according to each pooling mehtod. BAN with PSRoI pooling integrates the sub-networks that are detection heads by aggregating them. BAN with RoI pooling extracts 256-d convolutional features from the sub-networks and builds a single detection head using the concatenated features. Both structures improve the detection accuracy. However, the former is easy to analyze the contributions of contexts because all detectors, including the baseline, were structurally identical, and the latter contributes to higher improvement in accuracy because it generates more distinct features for R-FCN based detectors.

Learning. We use a weight decay of and a momentum of

with stochastic gradient descent (SGD). We train the network for 29k iterations with a learning rate of

dividing it by 10 at 20k iterations for PASCAL VOC and for 240k iterations with a learning rate of dividing it by 10 at 160k iterations for MS COCO. A mini-batch consists of 2 images, which are resized such that its shorter side of image is 600 pixels. In training, the online hard example mining (OHEM) [25] selects 128 RoIs of hard examples among 300 RoIs per image. OHEM evaluates the multi-task loss of all proposals then discard the proposals with the small loss to make the detector more focus on difficult samples. The detection network is trained with 4 synchronized GPUs: each GPU holds 2 images. We use 300 RoIs per image, which is obtained from RPN and post-processed by non-maximum suppression (NMS) with a threshold of 0.3 IoU, for both learning and inference.

Loss function.

The loss function is defined as a sum of the classification loss and the box regression loss. The classification loss is defined as a cross-entropy loss,

, where

is a discrete probability distribution over

categories and is a ground-truth class. The regression loss is defined as a smooth loss [12], , where is a tuple of bounding-box regression for each of the classes, indexed by , and is a tuple of a ground-truth bounding-box regression.

Cost Analysis. We perform the cost analysis on the inference time and the memory consumption (Table 7). The analysis is performed using ResNet-101 and RoI pooling based BAN. Our BAN easily improves various detection methods with a reasonable increase in memory and computing time.

Inference

Memory

Time

Consumption

R-FCN [15]

70ms

0.8GB

R-FCN-BAN

97ms

1.2GB

Deformable R-FCN [4]

96ms

0.9GB

Deformable R-FCN-BAN

120ms

1.2GB

Table 4: Boundary Context

None

79.54

61.95

 (Sides)

80.23

62.84

 (Vertices)

80.01

62.13

 (In/Out-boundary)

79.80

63.23

,

80.39

63.36

,,

80.75

64.66

Table 5: Feature Resolution

-

79.39

61.36

-

80.15

63.45

-

80.75

64.66

-

80.10

63.76

Table 6: Feature Sharing

- Unshared

80.05

62.80

- Shared

80.75

64.66

Table 7: Pooling Method

- PSRoI Pooling

80.75

64.66

- ROI Pooling

82.72

67.84

Table 3: Cost Analysis

4.2 Comparison with Strategies for BAN

We experiment the strategies for BAN such as different combinations of boundary contexts, various feature resolutions of sub-networks, feature sharing and pooling method to construct effective BAN. The experiments are performed using ResNet-101 and PSRoI pooling based BAN on PASCAL VOC.

Boundary Context. We conduct experiments on the types of boundary contexts (side, vertex and in/out-boundary contexts) and the combinations of the types (Table 7). All boundary contexts shows the meaningful improvement in the detection accuracy and the combination of the all three types of boundary contexts improves by 1.21 and by 2.71. This experiment shows that each boundary context have a distinct meaning for detection.

Feature Resolution. We conduct experiments on the feature resolution  of sub-networks from to  (Table 7). The feature resolution of shows the highest improvement and degrades the detection accuracy as it crushes the boundary contexts.

Feature Sharing. Each sub-network consists of a dimensional

convolution and the following relu for feature extraction and a

dimensional convolution as classification and a dimensional convolution as regression for detection heads (Table 7). The different use of convolution for feature extraction lead to the improvement of 0.73 point in mAP. This experiment implies that the boundary context transfers a distinctive influence to the feature level as well as the detection head in learning.

Pooling. The implementation of BAN is slightly different depending on the pooling method for extracting the visual context. We conduct experiments on two pooling methods: RoI pooling and PSRoI pooling. PSRoI pooling requires a small amount of resources because it is fully convolutional. RoI pooling highly improves the accuracy in detection because it is easy to extract the fundamental convolutional features for the boundary context (Table 7).

VOC 2007

MS COCO test-dev

Faster RCNN [23]

76.4 - 30.3 52.1 9.9 32.2 47.4

YOLOv2 [22]

79.5 - 21.6 44.0 5.0 22.4 35.5

SSD513 [19]

76.8 - 31.2 50.4 10.2 34.5 49.8

R-FCN [15]

79.5 62.0 29.9 50.8 11.0 32.2 43.9

R-FCN-BAN

82.7 67.8 34.4 58.5 17.8 37.7 46.0

Deformable R-FCN [4]

82.2 67.6 34.5 55.0 14.0 37.7 50.3

Deformable R-FCN-BAN

83.4 70.0 36.9 58.5 15.8 40.0 53.6
Table 8: Evaluation on PASCAL VOC 2007 and MS COCO test-dev

4.3 Experiments on PASCAL VOC

We evaluate the proposed BAN on PASCAL VOC [10] that has 20 object categories (Fig. 6). We train the models on the union set of VOC 2007 and VOC 2012 trainval, 07+12, (16,551 images), and evaluate on VOC 2007 test set (4,952 images). Detection accuracy is measured by mean Average Precision (mAP). BAN improves 3.2 mAP with a threshold of 0.5 IoU and 5.8 mAP with a threshold of 0.7 from R-FCN [15] and 1.2 mAP with a threshold of 0.5 IoU and 2.4 mAP with a threshold of 0.7 from Deformable R-FCN [4] (Table 8 and 6).

4.4 Experiments on MS COCO

We evaluate the proposed BAN on MS COCO dataset [18] that has 80 object categories. We train the models on the union set of 80k training set and 40k validation set (trainval), and evaluate on 20k test-dev set. The COCO-style metric denotes mAP, which is the average AP across thresholds of IoU from 0.5 to 0.95 with an interval of 0.05. Our BAN improves 4.5 COCO-style mAP and 7.7 mAP with a threshold of 0.5 IoU from R-FCN [15] and 2.4 COCO-style mAP and 3.5 mAP with a threshold of 0.5 IoU from Deformable R-FCN [4] (Table 8). We obtain the higher improvement in the detection accuracy for MS COCO with various classes and challenging environments, than PASCAL VOC.

5 Conclusions

We propose a boundary aware network (BAN) designed to exploit the boundary contexts and study empirically the influence of the boundary context on classification and bounding box regression. To show the effectiveness of BAN, we visualize the activation of the sub-networks according to the boundary contexts and empirically show that the boundary contexts of BAN contributes more strongly to the detection head intuitively related to the boundary context. These related contribution suggests that BAN implies distinct meanings than naive ensemble of sub-networks. We evaluate our method on PASCAL VOC detection benchmark dataset, which has 20 object categories and MS COCO dataset, which has 80 object categories. Our BAN improves mAP by 3.2 point from R-FCN and 1.2 point from Deformable R-FCN on PASCAL VOC and improves the COCO-style mAP by 4.5 point from R-FCN and 2.4 point from Deformable R-FCN on MS COCO. BAN allows the convolution network to provide an additional source of contexts for detection and selectively focus on more important contexts, and it can be generally applied to many other detection method as well to enhance the accuracy in detection.

As a future study, we will improve the detection accuracy by applying BAN to the entire network including RPN. In addition, we plan to develop a general version of BAN based on this study of the influence and relationship among the boundary contexts.

Acknowledgments

This work was supported by IITP grant funded by the Korea government (MSIT) (IITP-2014-3-00059, Development of Predictive Visual Intelligence Technology, IITP-2017-0-00897, SW Starlab support program, and IITP-2018-0-01290, Development of Open Informal Dataset and Dynamic Object Recognition Technology Affecting Autonomous Driving)

Figure 6: Examples of object detection results on PASCAL VOC 2007 test set using our method (83.4% mAP). The network is based on ResNet-101 and the training data is 07+12 trainval.

tableDetailed detection results on PASCAL VOC 2007 test set

Method mAP

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

mbike

person

plant

sheep

sofa

train

tv

Faster R-CNN

76.4

79.8

80.7

76.2

68.3

55.9

85.1

85.3

89.8

56.7

87.8

69.4

88.3

88.9

80.9

78.4

41.7

78.6

79.8

85.3

72.0

R-FCN [15]

79.5

82.5

83.7

80.3

69.0

69.2

87.5

88.4

65.4

65.4

87.3

72.1

87.9

88.3

81.3

79.8

54.1

79.6

78.8

87.1

79.5

R-FCN-BAN

82.7

89.1

88.4

80.7

76.9

73.3

89.6

88.8

89.5

69.5

88.0

74.5

90.0

89.3

86.8

80.5

57.6

84.3

84.7

88.5

84.5

DR-FCN [4]

82.2

85.9

89.3

80.7

74.8

72.4

88.2

88.8

89.5

69.0

88.2

75.4

89.7

89.4

84.5

83.4

57.3

84.9

82.3

87.6

82.7

DR-FCN-BAN

83.4

88.0

89.5

80.6

77.0

73.4

88.8

89.0

89.8

70.7

88.4

77.3

90.2

89.4

87.5

84.6

58.2

85.6

85.3

88.2

85.9

References

  • [1] Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.M.: Pyramid methods in image processing. RCA engineer (1984)
  • [2] Avidan, S.: Spatialboost: Adding spatial reasoning to adaboost. In: European Conference on Computer Vision (ECCV) (2006)
  • [3] Carbonetto, P., De Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: European Conference on Computer Vision (ECCV) (2004)
  • [4] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: IEEE International Conference on Computer Vision (2017)
  • [5]

    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2005)

  • [6] Ding, Y., Xiao, J.: Contextual boost for pedestrian detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2012)
  • [7] Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2009)
  • [8] Dollár, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2014)
  • [9] Dollár, P., Tu, Z., Perona, P., Belongie, S.: Integral channel features. In: British Machine Vision Conference (BMVC) (2009)
  • [10] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) (2010)
  • [11] Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2010)
  • [12] Girshick, R.: Fast r-cnn. In: IEEE International Conference on Computer Vision (ICCV) (2015)
  • [13] Gonzalez, R.C.: Digital image processing. Pearson Education India (2009)
  • [14] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [15] Li, Y., He, K., Sun, J., et al.: R-fcn: Object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems (NIPS) (2016)
  • [16] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [17] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: IEEE International Conference on Computer Vision (2017)
  • [18] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV) (2014)
  • [19] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision (ECCV) (2016)
  • [20] Lu, Y., Javidi, T., Lazebnik, S.: Adaptive object detection using adjacency and zoom prediction. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)
  • [21] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [22] Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [23] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)
  • [24] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) (2015)
  • [25] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
  • [26] Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. International Journal of Computer Vision (IJCV) (2013)
  • [27] Viola, P., Jones, M.J.: Robust real-time face detection. International Journal of Computer Vision (IJCV) (2004)
  • [28] Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2013)
  • [29] Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: European Conference on Computer Vision (ECCV) (2014)