Object Detection with Mask-based Feature Encoding

02/12/2018
by   Xiaochuan Fan, et al.
University of South Carolina
IEEE
0

Region-based Convolutional Neural Networks (R-CNNs) have achieved great success in the field of object detection. The existing R-CNNs usually divide a Region-of-Interest (ROI) into grids, and then localize objects by utilizing the spatial information reflected by the relative position of each grid in the ROI. In this paper, we propose a novel feature-encoding approach, where spatial information is represented through the spatial distributions of visual patterns. In particular, we design a Mask Weight Network (MWN) to learn a set of masks and then apply channel-wise masking operations to ROI feature map, followed by a global pooling and a cheap fully-connected layer. We integrate the newly designed feature encoder into the Faster R-CNN architecture. The resulting new Faster R-CNNs can preserve the object-detection accuracy of the standard Faster R-CNNs by using substantially fewer parameters. Compared to R-FCNs using state-of-art PS ROI pooling and deformable PS ROI pooling, the new Faster R-CNNs can produce higher object-detection accuracy with good run-time efficiency. We also show that a specifically designed and learned MWN can capture global contextual information and further improve the object-detection accuracy. Validation experiments are conducted on both PASCAL VOC and MS COCO datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 7

08/09/2017

CoupleNet: Coupling Global Structure with Local Parts for Object Detection

The region-based Convolutional Neural Network (CNN) detectors such as Fa...
11/30/2017

Multi-Channel CNN-based Object Detection for Enhanced Situation Awareness

Object Detection is critical for automatic military operations. However,...
11/13/2021

Factorial Convolution Neural Networks

In recent years, GoogleNet has garnered substantial attention as one of ...
07/08/2018

Auto-Context R-CNN

Region-based convolutional neural networks (R-CNN) fast_rcnn,faster_rcnn...
08/17/2021

Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs

In this paper, we challenge the common assumption that collapsing the sp...
12/17/2014

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

In this paper, we propose deformable deep convolutional neural networks ...
06/19/2019

ViP: Virtual Pooling for Accelerating CNN-based Image Classification and Object Detection

In recent years, Convolutional Neural Networks (CNNs) have shown superio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Region-based convolutional neural networks (R-CNNs) have been recognized as one of the most effective tools for object detection [10, 9, 27, 20, 4, 5, 11]. One important component in R-CNNs is region-wise feature encoder. In the standard Fast/Faster R-CNN [9, 27] architectures, this component is implemented by the Region-of-Interest (ROI) pooling followed by fully-connected (FC) layers, as shown in Fig. 1(a). Such a feature encoder usually introduces a huge number of connections and parameters. Recently, very deep networks, such as ResNet (Residual Network) [12] and SENet (Squeeze-and-Excitation Network) [14]

, have been proposed for image classification, without the inclusion of large FC layers. However, it is non-trivial to use these networks to help improve the object-detection performance due to the lack of translation variance

[4]. Special architectures have to be used in feature encoder to encode spatial information into these networks for object detection [4, 5], e.g. PS (Position-Sensitive) ROI pooling in R-FCN (Region-based Fully Convolutional Network) as shown in Fig. 1(b). To our best knowledge, all these architectures utilize grids to represent object-parts and reflect the spatial information in the ROI.

Figure 1: Feature-encoding comparison between our approach and Faster R-CNN/R-FCN. (a) Faster R-CNN: ROI pooling followed by large FC layers. (c) R-FCN: PS ROI pooling followed by voting. (c) Ours: ROI pooling followed by channel-wise masking, global pooling and a small FC layer.

In this paper, a novel feature-encoding approach is presented for object detection. The basic idea behind our approach is that, compared with grids, it is more natural to use middle-level visual patterns to represent object-parts. Given that each channel of a CNN feature map is expected to be an activation map for a specific visual pattern, e.g., middle-level attribute or object part [32, 33], we propose to learn a set of masks to reflect the spatial distribution associated to these visual patterns, and then applying channel-wise masking operations to ROI feature map, followed by a global pooling and a cheap fully-connected layer. This architecture is illustrated in Fig. 1(c), where, as an example, a feature channel which is strongly activated for human head is masked by a learned mask. We design a Mask Weight Network (MWN) for mask learning, and MWN is jointly trained with the whole object-detection network.

To validate the effectiveness and efficiency of proposed new feature encoder, we integrate it into the Faster R-CNN architecture and find that the resulting new Faster R-CNNs, named MWN-based Faster R-CNN (M-FRCN) in this paper, perform very well in terms of object-detection accuracy, model complexity, and run-time efficiency. More specifically, as compared with the standard Faster R-CNN with large FC layers, our M-FRCN preserves the object-detection accuracy, but using substantially fewer parameters. As compared with R-FCNs using state-of-art PS ROI pooling [4] and deformable PS ROI pooling [5], our M-FRCN can produce higher object-detection accuracy without losing time efficiency. We also show that a specifically designed and learned MWN can capture global contextual information. The combination of two MWNs, one for local regions and the other for global context, leads to new Faster R-CNNs, that can further improve the object detection accuracy. We conduct the validation experiments on both PASCAL VOC and MS COCO datasets.

Figure 2: Architecture of the proposed MWN-based Faster R-CNN (M-FRCN).

2 Related Work

Girshick et al. [10] first propose R-CNNs by evaluating CNNs on region proposals for bounding-box object detection. [9] extends R-CNN to Fast R-CNN by applying ROI pooling to enable end-to-end detector training on shared convolutional features. Ren et al. [27] further extend Fast R-CNN to Faster R-CNN by incorporating a Region Proposal Network (RPN). In the standard Fast/Faster R-CNN, the ROI pooling layer is followed by large fully-connected layers.

Recently, very deep feature-extraction architectures are proposed

[19, 12, 30, 16, 13, 15, 14], which do not need large fully-connected layers any more, such as GoogLeNet, ResNet, SENet, and DenseNet. Although these very-deep architectures can achieve impressive image-classification accuracy, it is non-trivial to directly use them to improve object detection due to their lack of spatial information encoding, as introduced in [4]. To address this issue, He et al. [12] insert the ROI pooling layer into convolutions to introduce translation variance. Dai et al. [4] propose R-FCN by adding a position-sensitive ROI pooling. In DCN (Deformable Convolutional Networks) [5], a deformable ROI pooling is proposed to add a learned offset to each ROI grid position, and thus enables adaptive part localization. In this paper, we propose a new feature-encoding approach for object detection. In the experiments, we show that this new feature encoder can be used for both shallow networks, e.g. VGG_CNN_M_1024 [2], and very deep networks, e.g. ResNet-101 [12], to improve object detection in terms of detection accuracy, number of parameters, and/or run-time efficiency.

Masks have been widely used in a variety of visual tasks, such as object detection [8, 18, 11], semantic segmentation [11, 3, 25]

, and pose estimation

[11, 7]. Gidaris and Komodakis [8] and Kantorov et al. [18] use a set of manually-designed masks to represent specific spatial patterns. [25] and [11] propose models to estimate segmentation masks. In [3], binary masks are applied to the convolutional feature maps to mask out background region. In this paper we learn masks to capture spatial information for object detection.

Global contextual information has been proven to be very valuable for object detection and recognition [31, 26, 24, 23]. Several techniques have been developed to incorporate global context into CNNs. For example, ParsetNet [22] concatenates features from the full image to each element in a feature map. ResNet [12] performs a global ROI pooling to obtain global features and concatenates global features to local region features. ION [1] uses a stacked spatial RNN to exploit global context. In ParsetNet and ResNet, all local features share the same global feature. Different from these networks, in this paper, we propose a method to extract ROI-specific global features. Unlike ION, the proposed method uses a convolution-based solution.

3 Our Approach

In this section, we first introduce the general architecture of the proposed MWN(Mask Weight Network)-based Faster R-CNN (M-FRCN). Then, we describe the new MWN to learn masks for local ROIs and global context, leading to MWN-l and MWN-g, respectively. For convenience, the proposed M-FRCN using MWN-l and MWN-g are abbreviated as M-FRCN-l and M-FRCN-g, respectively. Finally, we combine M-FRCN-l and M-FRCN-g into M-FRCN-lg by integrating local region and global contextual information for object detection. While in this paper we incorporate the proposed feature encoders into the Faster R-CNN, it is easy to incorporate them into other R-CNN frameworks.

3.1 General Architecture of M-FRCN

As illustrated by Fig. 2, for each ROI proposed by a Region Proposal Network (RPN) [27], we take the following steps to encode its feature map for classification and bounding-box regression. First, an initial ROI (or image) pooling converts an ROI (or image) to a fixed-size feature map , where, is the number of channels for , e.g. for a ResNet-101 model [12]. In the following sections, we will elaborate on the selection of the initial-pooling input, which can be either the full image or an ROI on the image. Different from standard Faster R-CNN, we use an average ROI pooling instead of a max one.

Second, a Mask Weight Network (MWN) takes a raw mask as input and performs a set of convolution operations on this raw mask to get a set of new masks , , using

kernels and a stride equal to

. We will elaborate on the selection of the raw mask and the MWN in the following two sections.

Third, we apply the -th new mask to , i.e., the -th channel of , by

(1)

where and are the horizontal and vertical positions in a mask or feature map, and denotes the ROI feature map after masking.

Finally, a global max pooling (GMP) is performed on each channel of

, followed by a fully-connected (FC) layer. We denote the number of output nodes of this FC layer as , and thus the number of connections of the FC layer is . Note that this new architecture does not incur large FC layers as the standard Faster R-CNN, where ROI pooling is followed by two FC layers. Taking an example of for VGG-16 architecture, the FC layer of the proposed M-FRCN only has connections while the standard Faster R-CNN has connections when setting .

3.2 M-FRCN-l

Following the general architecture of M-FRCN in Fig. 2, in this section we introduce M-FRCN-l by selecting a raw mask and a corresponding MWN-l that can learn new masks , for local ROIs. As shown in Fig. 3(a), for M-FRCN-l, the input of the initial pooling is a considered ROI proposed by RPN. As highlighted in the yellow region of Fig. 3(a), we simply select to be a unary mask where all its entries take a preset constant value of . MWN-l is comprised of a convolutional layer with kernels, which transforms to the new masks , . In our approach, we always select the convolution kernel size equal to , and the raw mask

is padded with zeros in the convolution. Fig.

3(b) presents an example of the zero-padded mask, where mask size and convolution kernel size are both equal to . Note that the zero-padded mask becomes a binary mask since .

A unary raw mask with preset constant entries seems to have no meaningful information to exploit by convolution. However, convolution at each mask entry actually involves this entry and its spatial neighbors. With zero-padding, each mask entry shows different spatial pattern when considering their neighbors, as shown in Fig. 3(c). These patterns reflect the relative position of each entry in ROI. Moreover, during the training, the network loss of classification and regression can be propagated backwards to MWN-l to learn the parameters in its convolution layer. In general CNNs, each channel of a CNN feature map is expected to be an activation map for a specific visual pattern [32, 33], e.g. middle-level attribute or object part. Similarly, the new masks , , output by MWN-l can reflect the spatial distribution of the visual pattern associated to each channel of the ROI feature map, i.e. the convolution result on each entry of the raw mask can represent the likelihood that a visual pattern appears at the position of that entry, according to the training set. By applying each new mask to its corresponding feature channel, we expect that the proposed M-FRCN can encode the spatial information necessary for accurate object detection. In the inference stage, the learned mask is independent with the input image and can be considered as constant, so the convolution in MWN-l can be waived.

Figure 3: (a) An illustration of masking operation in M-FRCN-l with . (b) The zero-padded raw mask for MWN-l convolution. (c) Each mask entry shows different pattern when considering spatial neighbors for () convolution filtering.
Figure 4: (a) An illustration of masking operation in M-FRCN-g with . (b) A zero-padded raw mask for the MWN-g convolution. (c) Each mask entry shows different pattern when considering spatial neighbors for ( ) convolution filtering. These patterns reflect the relative position of each entry not only in the full image, but also to the ROI.

3.3 M-FRCN-g

Following the general architecture of M-FRCN in Fig. 2, in this section we introduce M-FRCN-g by selecting a raw mask and a corresponding MWN-g that can learn new masks , for global context. As shown in Fig. 4(a), for M-FRCN-g, the input of the initial pooling is the full image for the global-contextual information. We then select the raw mask to reflect the relative position of a considered ROI in the image.

More specifically, for the considered ROI, we construct a binary context map of the same size as the image. We set an entry of this context map to if it is located inside the considered ROI, and otherwise, where . We then down-sample this context map to the size of and take it as the raw mask of the considered ROI, as highlighted in the yellow region of Fig. 4(a). For MWN-g that transforms the raw mask to the new masks , , we use a layer of convolution. Like MWN-l, the convolution kernel size is always set to , and the raw mask is padded with zeros in the convolution. Fig. 4(b) presents an example of the zero-padded mask for a considered ROI in MWN-g, where mask size and convolution kernel size are both equal to . With zero-padding, each mask entry shows different spatial pattern when considering their neighbors, as shown in Fig. 4(c). These patterns reflect the relative position of each entry not only in the full image, but also to the ROI.

The goal of M-FRCN-g is to exploit the global context – combining all the visual patterns in the full image to handle the recognition of an ROI. Certainly, the visual patterns inside and outside an ROI may contribute differently to the recognition of an ROI. We expect that the new masks learned by the proposed MWN-g can reflect the contributions of these visual patterns. In the inference stage, the learned mask is independent with the input image and is only relevant to the relative position of a considered ROI in the image.

3.4 M-FRCN-lg: Combining M-FRCN-l and M-FRCN-g

M-FRCN-l and M-FRCN-g extract ROI features in different scales. M-FRCN-l focuses on ROI’s local appearance, while M-FRCN-g exploits global context. We can combine them to further boost the object-detection accuracy. In the combined model, the

-d feature vectors after GMP from M-FRCN-l and M-FRCN-g are simply concatenated together and fed to the following layers. Besides, backbone convolutional layers and RPN are shared by M-FRCN-l and M-FRCN-g. We refer to this combined network as M-FRCN-lg.

4 Experiments

4.1 Parameters

The parameters that need to be tuned in our approach are:

- : the number of hidden nodes of the FC layer.

- , , and : entry values of the raw mask.

- : initial ROI pooling scale for M-FRCN and kernel size for MWN convolution.

In our experiments, we identify the value of at the end of a model’s name, e.g., M-FRCN-l- indicating the M-FRCN-l with . For M-FRCN-lg, we always combine M-FRCN-l and M-FRCN-g with the same value, e.g., M-FRCN-lg- is the combination of M-FRCN-l- and M-FRCN-g-

. All proposed and comparison methods are implemented using Caffe

[17] and we use a single Titan-X GPU for both training and inference. All experimental results of the existing works are re-produced in the same training and inference configurations, and thus they may be not exactly same as reported in papers.

4.2 Experiments on PASCAL VOC

Implementation details. Following many existing works on object detection, we evaluate our approach on the PASCAL VOC detection benchmarks [6]

, which contain objects from 20 categories. We train models on the union set of VOC 2007 trainval and VOC 2012 trainval (VOC 07+12 trainval), and test them on VOC 2007 test set (VOC 07 test). All models are finetuned on pre-trained ImageNet classification models. The input image is scaled such that its shorter side is

pixels. We use a weight decay of , a momentum of , and a mini-batch size of . The learning rate is initialized as and is decreased by a factor of after k iterations. A total of k iterations are performed for training. Note that our MWNs are jointly trained with other network components in an end-to-end manner. In the inference stage, we still use a single-scale ( pixels) scheme. RPN provides region candidates for the following classification and bounding-box regression. A standard non-maximum suppression (NMS) is performed on the detections with an overlap threshold of . For M-FRCN, we set , , and . Besides, we set the value of for M-FRCN-l/g/lg according to the adopted CNN backbone architecture. We report Average Precision using IoU threshold at (AP@) to evaluate the accuracy.

Baselines. In order to clearly show the effectiveness of our MWNs and M-FRCNs, we introduce two baseline networks. One has a same architecture as M-FRCN-l, except that the original MWN-l is bypassed and the raw unary mask is directly applied to the ROI feature map. This baseline network, referred to as baseline-local, is actually a standard Faster R-CNN with ROI pooling. The other baseline has a same architecture as M-FRCN-lg, except that the full-image feature map is globally pooled directly without using the masking operation based on MWN-g. This baseline network, referred to as baseline-global, aims to validate the effectiveness of our MWN-g designed for capturing global context for each ROI.

AP@ (%)
baseline-local 65.0
baseline-global 74.0 74.7 74.4
M-FRCN-l 73.0 73.7 73.5
M-FRCN-g 55.6 66.9 68.5
M-FRCN-lg 75.4 75.9 75.8
Table 1: Accuracy of baselines and M-FRCN-l/g/lg on VOC 07 test, using different values of . VGG-16 is used as CNN backbone.

Comparisons with baselines using different values. At first, we compare M-FRCN-l/g/lg and baselines, by varying the values of . In this experiment, VGG-16 [29] is used as CNN backbone. We set for M-FRCN-l and M-FRCN-g, and for M-FRCN-lg. From Tab. 1, we can see that M-FRCN-l-7/15/21 have much higher AP than baseline-local, thanks to the proposed channel-wise masking. Besides, M-FRCN-l-7/15/21 have similar APs, suggesting that finer initial ROI-pooling does not help. Although M-FRCN-g has much lower AP than M-FRCN-l, it is still interesting to see that M-FRCN-g alone is able to detect objects effectively using the masked full-image feature map. We conjecture the lower AP is caused by full-image initial pooling that only preserves very coarse spatial information. Moreover, M-FRCN-g-15 achieves higher AP than M-FRCN-g-7, and M-FRCN-g-21 performs better than M-FRCN-g-15. This indicates that a fine full-image feature map is important for M-FRCN-g. Finally, M-FRCN-lg is evaluated. We find that M-FRCN-lg always outperforms M-FRCN-l and baseline-global, demonstrating that global context is effectively utilized to improve object-detection accuracy. Particularly, the comparison between M-FRCN-lg and baseline-global clearly validates that our ROI-specific context feature leads to better accuracy than using uniform global feature for all ROIs in an image.

Comparisons with Faster R-CNN and R-FCN using relatively shallow networks. We compare our M-FRCN-lg with the standard Faster R-CNN [27] using VGG-16 [29] and VGG_CNN_M_1024 [2] models, and R-FCN using VGG16 model. We set for M-FRCN-l and for M-FRCN-lg, for both VGG-16 and VGG_CNN_M_1024. The results are shown in Tab. 2. We can see that M-FRCN-l-15 has lower AP than standard Faster R-CNN, but the model complexity is significantly reduced without using network compression technique and specifically designed CNN backbone. Moreover, M-FRCN-lg-15 outperforms M-FRCN-l-15 by considering global context and achieves quite similar accuracy as the standard Faster R-CNN, with small increase in the number of model parameters. Finally, fully-convolutional R-FCN has much lower AP than Faster R-CNN and M-FRCNs.

AP@ (%) # params
VGG_CNN_M_1024 [2]
Faster R-CNN [27] 64.8 87.5M
M-FRCN-l-15 62.1 8.0M
M-FRCN-lg-15 63.6 8.5M
VGG-16 [29]
Faster R-CNN [27] 76.5 137.1M
R-FCN [4] 62.3 17.6M
M-FRCN-l-15 73.7 17.4M
M-FRCN-lg-15 75.9 17.9M
Table 2: Accuracy, model complexity, and test time on VOC 07 test set, using VGG_CNN_M_1024 and VGG-16 as CNN backbone.
CNN backbone feature encoder AP@ AP AP AP AP # params run-time
small medium large (sec/img)
ResNet-101 PS ROI pooling [4] 48.9 28.8 10.7 31.2 41.8 53.8M 0.15
[12] deformable PS ROI pooling [5] 50.0 30.6 11.9 34.1 43.7 62.1M 0.23
our M-FRCN-l-7 51.0 30.7 11.6 33.4 45.0 49.5M 0.15
our M-FRCN-lg-7 51.9 31.2 12.0 34.3 46.3 51.7M 0.17
SE-ResNet-101 PS ROI pooling [4] 51.3 30.7 11.6 33.7 44.8 58.6M 0.16
[14] deformable PS ROI pooling [5] 52.8 32.6 12.8 36.3 46.7 66.8M 0.24
our M-FRCN-l-7 53.6 33.0 12.5 36.4 48.3 54.3M 0.16
our M-FRCN-lg-7 54.3 33.4 12.8 36.6 49.0 56.5M 0.18
Table 3: Accuracy, model complexity, and test time of R-FCNs and M-FRCNs on COCO test-dev set, using ResNet-101 or SE-ResNet-101 as CNN backbone.

4.3 Experiments on MS COCO

Implementation details. We also evaluate our M-FRCNs on the MS COCO dataset [21], that contains images of 80 categories of objects. We train models using the union of training and validation images (COCO trainval), and test on the test-dev set (COCO test-dev) [20]. In this experiment, we use the similar training and inference configurations as the experiment on Pascal VOC, except that the learning rate is initialized as and is decreased by a factor of after M iterations. Totally, M iterations are performed for training. Besides, the online hard example mining (OHEM) [28] is also adopted for training. We evaluate object detection by COCO-style AP @ and PASCAL-style AP@ . Model complexity and run-time efficiency are also reported.

Comparisons with R-FCNs using very deep networks. We conducted experiments by using very deep networks to build our M-FRCNs and compare their performance against two state-of-art feature encoders for object-detection networks. One is Position-Sensitive (PS) ROI pooling proposed in R-FCN [4], and the other one is deformable PS ROI pooling which is proposed in DCN [5]

. Deformable PS ROI pooling allows grids to shift from their original positions and thus becomes much more flexible, but the bilinear interpolation operations performed on score maps lead to slower run-time, especially when the number of categories is large. As for CNN backbone, we choose to use ResNet-101

[12] and SE-ResNet-101 [14]. SE-ResNet-101 equips ResNet-101 with Squeeze-and-Excitation components, which is the key technique of the winner model of ILSVRC 2017 Image Classification Challenge. Moreover, we set for M-FRCN-l and M-FRCN-g, and for M-FRCN-lg. is set to to fit model training to

GB GPU memory. Note that DCN is officially implemented in a different deep-learning framework (MXNet) with ours (Caffe), so we re-implement the deformable PS ROI pooling in Caffe. But we did not re-implement the deformable CNN backbone, since this work only focuses on feature encoder. We expect our current experiments can show the ability of our proposed feature encoder to improve object-detection performance for different CNN backbones.

The results are shown in Tab. 3. Compared with the standard R-FCN using PS ROI pooling, M-FRCN-ls have higher accuracy, lower model complexity and similar run-time efficiency. Moreover, M-FRCN-ls have a little higher overall AP and about faster run-time efficiency than the enhanced R-FCN using deformable PS ROI pooling, with fewer parameters. M-FRCN-lg-7 further improves the accuracy of M-FRCN-l-7 by considering global context, while the number of parameters increases by M and run-time increases by sec/img. Compared with the enhanced R-FCN, M-FRCN-lg-7 has higher object-detection accuracy, less run-time, and lower model complexity.

Figure 5: An example of how M-FRCN-l encodes spatial information for a feature channel which is strongly activated for human head. (a) Original image. (b) Mask learned by MWN-l for this channel. (c) Visualization of the channel activation and two example ROIs (indicated by green boxes). (d) Resulting ROI feature maps after masking.
Figure 6: 128 example masks learned by MWN-l.

4.4 Qualitative Analysis

In this section, we use real examples to show how the masks learned by MWN-l and MWN-g help object detection. In Fig. 5, we present an example of how M-FRCN-l encodes spatial information for a feature channel which is strongly activated for human head. The mask learned by MWN-l for this channel has high values at the top-center positions as shown in Fig. 5(b), which is reasonable since head tends to appear at the top-center position of a human body in COCO dataset. By multiplying the ‘head’ channel of a ROI feature map (Fig.5(c)) with its mask, the spatial information of human head can be effectively encoded for object detection. As shown in Fig. 5(c) and (d), if an ROI accurately overlaps with a human body, there is strong activation in the masked feature. On the contrary, if an ROI does not overlap with a human body very well, the activation is very weak. Moreover, we present Fig. 6 to show 128 example masks learned by MWN-l. We can see that our MWN-l can produce masks with very complicated patterns.

For M-FRCN-l, we also observe feature channels that are activated for specific position of an object, rather than specific visual pattern. We think these feature channels can help object localization. Fig. 7(a) and (b) present activation maps of two such channels and their learned masks. The feature channel shown in Fig. 7(a) is strongly activated for the horizontal endpoints of an object. Interestingly, the learned mask for this feature channel has high values in the middle, rather than the left and the right sides. We believe the use of this mask can help detect an object by not confusing with other nearby objects, as shown in the fourth and fifth example images in Fig. 7(a). The feature channel shown in Fig. 7(b) is strongly activated for the lower right corner of an object, and the learned mask has high values at the right side, which is also reasonable.

For M-FRCN-g, we present three examples in Fig. 8 to show the masks learned by MWN-g for a feature channel which is strongly activated at leaves around a bird. Fig. 8(a) shows three bird images where birds are surrounded by leaves, and Fig. 8(b) presents the activation maps for the ‘leaf’ feature channel and ROI examples. As shown in Fig. 8(c), the learned masks largely have high values around the ROI, indicating that the leaves (visual pattern) around a ROI (spatial information) can contribute to the ROI recognition. We can see that our M-FRCN-g provides a novel method to utilize the visual patterns outside the ROI in the full image to serve for the recognition of a specific ROI.

Figure 7: Rows (a) and (b) show the activation maps of two channels on five images, and the masks learned for these channels. Original images are shown on the top row and example ROIs are indicated by green boxes.
Figure 8: Examples of masks learned in M-FRCN-g for a feature channel which is strongly activated at leaves around a bird. (a) Original images. (b) Visualization of the channel activation and example ROIs (indicated by green boxes). (c) Masks produced by MWN-g for this channel, given the ROIs presented in (b).
Figure 9: An illustration of ROI pooling by including masking operations. (a) ROI pooling () in existing R-CNNs, where grids are represented by different colors. (b) Using a set of masks to implement the ROI pooling shown in (a), by sequentially performing an initial ROI pooling, four masking operations, and global max poolings (GMPs). (c) In this paper, our basic idea is to relax these masks to non-binary ones and learn them automatically for better feature encoding and object detection.

5 Relation between Masked-based and Grid-based Feature Encoders

In this section, we show that the proposed mask-based feature encoder is a generalization of the traditional grid-based one. In the standard Fast/Faster R-CNN architectures, ROI pooling partitions each region proposal into equal-sized grids and then performs a pooling operation within each grid to produce a fixed-dimensional ROI feature map for the following layers. The ROI feature map is transformed to a feature vector in the raster-scan order, which reflects the spatial relations between grids. An example of this ROI pooling operation with is shown in Fig. 9(a). Actually, this standard ROI pooling can be implemented by an initial pooling with (e.g. ) followed by applying a set of binary masks and global max poolings, as shown in Fig. 9(b). In each mask, only entries in specific spatial locations take the value 1.

Our mask-based feature encoder relaxes these masks to more informative non-binary ones and learn them using a MWN. Furthermore, it can be relaxed to learn different sets of masks for different channels. Fig. 9(c) shows an example of masking operation in the proposed method, where a learned non-binary mask is applied to a channel of ROI feature map, and a feature element is output.

6 Conclusion

In this paper, we proposed a new feature encoder for object detection. Unlike the existing methods which utilize grids to represent object-parts and learn what is likely to appear in each grid, the proposed method learns masks to reflect spatial distributions of a set of visual patterns. The proposed feature encoder can be used to capture both local ROI appearance and global context. By integrating our feature encoder to Faster R-CNN architecture, we obtain MWN-based Faster R-CNN (M-FRCN). As shown by the experimental results, M-FRCNs have comparable object-detection accuracy with the standard Faster R-CNNs, but have significantly reduced model complexity. When compared with R-FCNs using very-deep CNN backbones, M-FRCNs can produce higher object-detection accuracy with good run-time efficiency.

References