CenterMask : Real-Time Anchor-Free Instance Segmentation

11/15/2019 ∙ by Youngwan Lee, et al. ∙ ETRI 0

We propose a simple yet efficient anchor-free instance segmentation, called CenterMask, that adds a novel spatial attention-guided mask (SAG-Mask) branch to anchor-free one stage object detector (FCOS) in the same vein with Mask R-CNN. Plugged into the FCOS object detector, the SAG-Mask branch predicts a segmentation mask on each box with the spatial attention map that helps to focus on informative pixels and suppress noise. We also present an improved VoVNetV2 with two effective strategies: adds (1) residual connection for alleviating the saturation problem of larger VoVNet and (2) effective Squeeze-Excitation (eSE) deals with the information loss problem of original SE. With SAG-Mask and VoVNetV2, we deign CenterMask and CenterMask-Lite that are targeted to large and small models, respectively. CenterMask outperforms all previous state-of-the-art models at a much faster speed. CenterMask-Lite also achieves 33.4% mask AP / 38.0% box AP, outperforming the state-of-the-art by 2.6 / 7.0 AP gain, respectively, at over 35fps on Titan Xp. We hope that CenterMask and VoVNetV2 can serve as a solid baseline of real-time instance segmentation and backbone network for various vision tasks, respectively. Code will be released.



There are no comments yet.


page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, instance segmentation has made great progress beyond object detection. The most representative method, Mask R-CNN [9], extended on object detection (e.g., Faster R-CNN [28]), has dominated COCO [21] benchmarks since instance segmentation can be easily solved by detecting objects and then predicting pixels on each box. However, even if there have been many works [14, 2, 3, 18, 22] for improving the Mask R-CNN, few works exist for considering the speed of the instance segmentation. Although YOLACT [1] is the first real-time one-stage instance segmentation due to its parallel structure and extremely lightweight assembly process, the accuracy gap from Mask R-CNN is still significant. Thus, we aim to bridge the gap by improving both accuracy and speed.

Figure 1: Accuracy-speed Tradeoff. across various instance segmentation models (top) and backbone networks (bottom) on COCO. The inference speed of CenterMask & CenterMask-Lite is reported on the same GPU (V100/Xp) with their counterparts; larger model: Mask R-CNN [9]/TensorMask [5]/RetinaMask [7]/Shapemask [15] and small model: YOLACT [1]. Note that all backbone networks in the bottom are compared under the proposed CenterMask. Please refer to section 3.2, Table 3 and Table 5 for details.
Figure 2: Architecture of CenterMask.

where P3 (stride of

) to P7 (stride of ) denote the feature map in feature pyramid of backbone network. Using the features from the backbone, FCOS predicts bounding boxes. Spatial Attention-Guided Mask (SAG-Mask) predicts segmentation mask inside of the each detected box with Spaital Attention Module (SAM) helping to focus on the informative pixels but also suppress the noise.

While Mask R-CNN is based on a two-stage object detector (e.g., Faster R-CNN) that first generates box proposals and then predicts box location and classification, YOLACT is built on one-stage detector (RetinaNet [20]) that directly predicts boxes without proposal step. However, these object detectors rely heavily on pre-define anchors, which are sensitive to hyper-parameters (e.g., input size, aspect ratio, scales, etc.) and different datasets. Besides, since they densely place anchor boxes for higher recall rate, the excessively many anchor boxes cause the imbalance of positive/negative samples and higher computation/memory cost. To cope with these drawbacks of anchor boxes, recently, many works [16, 6, 34, 35, 31, 34] tend to escape from the anchor boxes toward anchor-free by using corner/center points, which leads to more computation-efficient and better performance compared to anchor box based detectors.

Therefore, we design a simple yet efficient anchor-free one stage instance segmentation called CenterMask that adds a novel spatial attention-guided mask branch to the more efficient one-stage anchor-free object detector (FCOS [31]) in the same way with Mask R-CNN. Plugged into the FCOS object detector, our spatial attention-guided mask (SAG-Mask) branch takes the predicted boxes from the FCOS detector to predict segmentation masks on each Region of Interest (RoI). The spatial attention module (SAM) in the SAG-Mask helps the mask branch to focus on meaningful pixels and suppressing uninformative ones.

When extracting features on each RoI for mask prediction, each RoIAlign [9] should be assigned considering the RoI scales. Mask R-CNN uses an assignment rule proposed in  [19] that does not consider the input scale. Thus, we design a scale-adaptive RoI assignment function that considers the input scale and is a more suitable one-stage object detector.

We also propose a more effective backbone network VoVNetV2 based on VoVNet [17] that shows better performance and faster speed than ResNet [10] and DenseNet [13] due to its One-shot Aggregation (OSA). We found that stacking the OSA modules in VoVNet makes the performance saturated. We see this phenomenon as the motivation of ResNet [10]

because the backpropagation of gradient is disturbed. Thus, we add the residual connection 

[10] into each OSA module to ease the optimization, which makes the VoVNet deeper and in turn, boosts the performance.

We also found that the two fully-connected (FC) layers in the Squeeze-Excitation (SE) [12] channel attention module that reduce channel dimension to mitigate the burden of computation, which instead causes channel information loss. Thus, we re-design the SE module as effective SE (eSE) replacing the two FC layers with one FC layer maintaining channel dimension, which prevents the information loss and in turn, improves the performance. With residual connection and eSE modules, We propose VoVNet on various scales; from lightweight VoVNetV2-19, base VoVNetV2-39/57 and large model VoVNetV2-99 that are correspond with MobileNet-V2, ResNet-50/101 & HRNet-W18/32, and ResNeXt-32x8d.

With SAG-Mask and VoVNetV2, we design CenterMask and CenterMask-Lite that are targeted to large and small models, respectively. The Extensive experiments demonstrate the effectiveness of CenterMask & CenterMask-Lite and VoVNetV2. Using the same ResNet-101 backbone, CenterMask outperforms all previous state-of-the-art single models on the COCO [21] instance and detection tasks while at a much faster speed. CenterMask-Lite with VoVNetV2-39 bakcbone also achieves 33.4% mask AP / 38.0% box AP, outperforming the state-of-the-art real-time instance segmentation YOLACT [1] by 2.6 / 7.0 AP gain, respectively, at over 35fps on Titan Xp.

2 CenterMask

In this section, first, we review the anchor-free object detector, FCOS, which is a fundamental object detection part of our CenterMask. Next, we demonstrate the architecture of the CenterMask and describe how the proposed spatial attention-guided mask branch (SAG-Mask) is designed to plug into the FCOS detector. Finally, a more effective backbone network, VoVNetV2, is proposed to boost the performance of CenterMask in terms of accuracy and speed.

2.1 Fcos

FCOS is an anchor-free and proposal-free object detection in a per-pixel prediction manner as like FCN [24]. Almost state-of-the-art object detectors such as Faster R-CNN [28], YOLO [27], and RetinaNet [20]

use the concept of the pre-defined anchor box which needs elaborate parameter tunning and complex calculation associated with box IoU in training. Without the anchor-box, the FCOS directly predicts a 4D vector plus a class label at each spatial location on a level of feature maps. As shown in Figure  

2, the 4D vector embeds the relative offsets from the four sides of a bounding box to the location (e.g., left, right, top and bottom). In addition, FCOS introduces the centerness branch to predict the deviation of a pixel to the center of its corresponding bounding box, which improves the detection performance. Avoiding complex computation of anchor-boxes, FCOS reduces memory/computation cost but also outperforms the anchor box based object detectors. Because of the efficiency and good performance of the FCOS, we design the proposed CenterMask built upon the FCOS object detector.

2.2 Architecture

Figure 2

shows overall architecture of the CenterMask. CenterMask consists of three-part:(1) backbone for feature extraction, (2) FCOS detection head, and (3) mask head. The procedure of masking objects is composed of detecting objects from the FCOS box head and then predicting segmentation masks inside the cropped regions in a per-pixel manner.

2.3 Adaptive RoI Assignment Function

After object proposals are predicted in the FCOS box head, CenterMask predicts segmentation masks using the predicted box regions in the same vein as Mask R-CNN. As the RoIs are predicted from different levels of feature maps in Feature Pyramid Network (FPN [19]), RoI Align [9] that extracts features should be assigned at different scales of feature maps with respect to RoI scales. Specifically, an RoI with a large scale has to be assigned to a higher feature level and vice versa. Mask R-CNN [9] based two-stage detector uses Equation 1  in FPN [19] to determine which feature map (Pk) to be assigned.


where k0 is 4 and w, h are the width and height of the each RoI. However, Equation 1 is not suitable for CenterMask based one-stage detector because of two reasons. First, Equation 1 is tuned to two-stage detectors (e.g.,FPN [19]) that use different feature levels compared to one-stage detectors (e.g, FCOS [31], RetinaNet [20]). Specifically, two-stage detectors use feature levels of P2 (stride of 2) to P5 (2) while one-stage detectors use from P3 (2) to P7 (2

) that is larger receptive fields with lower-resolution. Besides, the canonical ImageNet pretraining size 224 in Equation

1 is hard-coded and not adaptive to feature scale variation. For example, when the input dimension is 10241024 and the area of an RoI is 224, the RoI is assigned to relative higher feature P4 despite its small size of the area with respect to input dimension, which results in reducing small object AP. Therefore, we define Equation 2 as a new RoI assignment function suited for CenterMask based one-stage detectors.


where kmax is the last level (e.g., 7) of feature map in backbone and , are area of input image and the RoI, respectively. Without the canonical size 224 in Equation 1, Equation 2 adaptively assign RoI pooling scale by the ratio of input/RoI area. If k is lower than minimum level (e.g., P3), k is clamped to the minimum level. Specifically, if the area of an RoI is bigger than half of the input area, the RoI is assigned to the highest feature level(e.g., P7). Inversely, while Equation 1 assigns P4 to the RoI with 224, Equation 2 determine kmax - 5 level which maybe minimum feature level for area of the RoI that is about smaller than input size. We can find that the proposed RoI assignment method improves the small object AP than Equation 1 because of its adaptive and scale-aware assignment strategy in Table 4. From an ablation study, we set kmax to P5 and kmin to P3.

Figure 3: Comparison of OSA modules. denote conv layer respectively, is global average pooling, is fully-connected layer, is channel attention map, indicates element-wise multiplication and denotes element-wise addition.

2.4 Spatial Attention-Guided Mask

Recently, attention methods [12, 32, 36, 26] have been widely applied for object detections because it helps to focus on important features but also suppress unnecessary ones. In particular, channel attention [12, 11] emphasizes ‘what’ to focus across channels of feature maps while spaital attention [32, 4] focuses ‘where’ is an informative regions. Inspired by the spatial attention mechanism, we adopt a spatial attention module to guide the mask head for spotlighting meaningful pixels and repressing uninformative ones.

Thus, we design a spatial attention-guided mask (SAG-Mask), as shown in Figure 2. Once features inside the predicted RoIs are extracted by RoI Align [9] with 1414 resolution, those features are fed into four conv layers and spatial attention module (SAM) sequentially. To exploit the spatial attention map as a feature descriptor given input feature map , the SAM first generates pooled features ,

by both average and max pooling operations respectively along the channel axis and aggregates them via concatenation. Then it is followed by a


layer and normalized by the sigmoid function. The computation process is summarized as follow:


where denotes the sigmoid function, is conv layer and represents concatenate operation. Finally, the attention guided feature map is computed as:


where denotes element-wise multiplication. After then, a deconv upsamples the spatially attended feature map to resolution. Lastly, a conv is applied for predicting class-specific masks.

2.5 VoVNetV2 backbone

In this section, we propose more effective backbone networks, VoVNetV2, for further boosting the performance of CenterMask. VoVNetV2 is improved from VoVNet [17] by adding residual connection [10] and the proposed effective Squeeze-and-Excitation (eSE) attention module to the VoVNet. VoVNet is a computation and energy-efficient backbone network that can efficiently present diversified feature representation because of One-Shot Aggregation (OSA) modules. As shown in Figure 3(a) OSA module consists of consecutive conv layers and aggregates the subsequent feature maps at once, which can capture diverse receptive fields efficiently and in turn outperforms DenseNet and ResNet in terms of accuracy and speed.

Residual connection: Even with its efficient and diverse feature representation, VoVNet has a limitation with respect to optimization. As OSA modules are stacked (i.g., deeper) in VoVNet, we observe the accuracy of the deeper models is saturated. Based on the motivation of ResNet [10], We conjecture that stacking OSA modules make the backpropagation of gradient gradually hard due to the increase of transformation functions such as conv. Therefore, as shown in Figure 3(b), we also add the identity mapping [10] to OSA modules. Correctly, the input path is connected to the end of an OSA module that is able to backpropagate the gradients of every OSA module in an end-to-end manner on each stage as like ResNet. Boosting the performance of VoVNet, the identity mapping also makes the VoVNet possible to enlarge its depth such as VoVNet-99.

Effective Squeeze-Excitation (eSE): For further boosting the performance of VoVNet, We also design a channel attention module, effective Squeeze-Excitation (eSE), improving original SE [12] more effectively. As the representative channel attention method adopted in CNN architectures, Squeeze-Excitation (SE) [12] explicitly models the interdependency between the channels of feature maps to enhance its representation. The SE module squeezes the spatial dependency by global average pooling to learn a channel specific descriptor and then two fully-connected (FC) layers followed by a sigmoid function are used to rescale the input feature map to highlight only useful channels. In short, given input feature map , the channel attention map is computed as:


where is channel-wise global average pooling, are weights of two fully-connected layers,

denotes ReLU non-linear operator and

indicates sigmoid function.

However, we assume a limitation of the SE module: channel information loss due to dimension reduction. For avoiding high model complexity burden, two FC layers of the SE module need to reduce channel dimension. Specifically, While the First FC layer reduces input feature channels to using reduction ratio , the second FC layer expands the reduced channels to original channel size . As a result, this channel dimension reduction causes channel information loss.

Therefore, we propose effective SE (eSE) that uses only one FC layer with channels instead of two FCs without channel dimension reduction, which rather maintains channel information and in turn improves performance. the eSE process is defined as:


where is the diversified feature map computed by conv in OSA module. As a channel attentive feature descriptor, the is applied to the diversified feature map to make the diversified feature more informative. Finally, when using the residual connection, the input feature map is element-wise added to the refined feature map . The details of How the eSE module is plugged into the OSA module are shown in Figure 3(c).

2.6 Implementation details

Since CenterMask is built on FCOS [31] object detector, we follow hyper-parameters of FCOS except for positive score threshold 0.03 instead of 0.05 Since FCOS does not generate positive RoI samples well in initial training time. While using FPN levels 3 through 7 with 256 channels in the detection step, we use P3 P7 in the masking step, as mentioned in 2.3. We also use mask scoring [14] that recalibrates classification score with predicted mask IoU score in Mask R-CNN.

CenterMask-Lite: To achieve real-time processing, we try to make the proposed CenterMask lightweight. We downsize three parts: backbone, box head, and mask head. In the backbone, first, we reduce the channels of FPN from 256 to 128, which can decrease the output of conv in FPN but also input dimension of box and mask head. And then, we replace the backbone network with more lightweight VoVNetV2-19 that has 4 OSA modules on each stage comprised of 3 conv layers instead of 5 as in VoVNetv2-39/57. In the box head, there are four conv layers with 256 channels on each classification and box branch where the centerness branch is shared with the box branch. We reduce the number of conv layer from 4 to 2 with 128 channels. Lastly, in the mask head, we also reduce the number of conv layers and channels in the feature extractor and mask scoring part from (4, 256) to (2, 128), respectively.

Training: We set the number of detection boxes from the FCOS to 100, and the highest-scoring boxes are fed into the SAG-mask branch for training mask branch. We use the same mask target as Mask R-CNN that is made by the intersection between an RoI and its associated ground-truth mask. During training time, we define a multi-task loss on each RoI as:


where the classification loss , centerness loss , and box regression loss are same as those in  [31] and is the average binary cross-entropy loss identical as in  [9]. Unless specified, the input image is resized to have 800 pixels [19]

along the shorter side and their longer side less or equal to 1333. We train CenterMask by using Stochastic Gradient Descent (SGD) for 90K iterations (

12 epoch) with a mini-batch of 16 images and initial learning rate of 0.01 which is decreased by a factor of 10 at 60K and 80K iterations, respectively. We use a weight decay of 0.0001 and a momentum of 0.9, respectively. All backbone models are initialized by ImageNet pre-trained weights.

Inference: At test time, the FCOS detection part yields 50 high-score detection boxes, and then the mask branch uses them to predict segmentation masks on each RoI. CenterMask/CenterMask-Lite use a single scale of 800/600 pixels for the shorter side, respectively.

Component APmask APbox Time (ms) FCOS (baseline), ours - 37.8 57 + mask head (Eq. 1 [19]) 33.4 38.3 67 + mask head (Eq. 2, ours) 33.6 38.3 67 + SAM 33.8 38.6 67 + Mask scoring 34.4 38.5 72
Table 1: Spatial Attention Guided Mask (SAG-Mask)These models use ResNet-50 backbone. We note that the mask heads with Eq.1 is same as the mask branch of Mask R-CNN. SAM and Scoring denotes the proposed Spatial Attention Module and mask scoring [14].
Backbone Params. APmask APbox Time (ms) VoVNetV1-39 49.0M 35.3 39.7 68 + residual 49.0M 35.5 (+0.2) 39.8 (+0.1) 68 + SE [12] 50.8M 34.6 (-0.7) 39.0 (-0.7) 70 + eSE, ours 52.6M 35.6 (+0.3) 40.0 (+0.3) 70 VoVNetV1-57 63.0M 36.1 40.8 74 + residual 63.0M 36.4 (+0.3) 41.1 (+0.3) 74 + SE [12] 65.9M 35.9 (-0.2) 40.8 77 + eSE, ours 68.9M 36.6 (+0.5) 41.5 (+0.7) 76
Table 2: VoVNetV2 Start from VoVNetV1, VoVNetV2 is improved by adding residual connection [10] and the proposed effetive SE (eSE).
Backbone Params. APmask AP AP AP APbox AP AP AP Time (ms)
MobileNetV2 [29] 28.7M 29.5 12.0 31.4 43.8 32.6 17.8 35.2 43.2 56
VoVNetV2-19 37.6M 32.2 14.1 34.8 48.1 35.9 20.8 39.2 47.6 59
HRNetV2-W18 [30] 36.4M 33.0 14.3 34.7 49.9 36.7 20.7 39.4 49.3 80
ResNet-50 [10] 51.2M 34.4 14.8 37.4 51.4 38.5 21.7 42.4 51.0 72
VoVNetV1-39 [17] 49.0M 35.3 15.5 38.4 52.1 39.7 23.0 43.3 52.7 68
VoVNetV2-39 52.6M 35.6 16.0 38.6 52.8 40.0 23.4 43.7 53.9 70
HRNetV2-W32 [30] 56.2M 36.2 16.0 38.4 53 40.6 23.0 43.8 53.1 95
ResNet-101[10] 70.1M 36.0 16.5 39.2 54.4 40.7 23.4 44.3 54.7 91
VoVNetV1-57 [17] 63.0M 36.1 16.2 39.2 54.0 40.8 23.7 44.2 55.3 74
VoVNetV2-57 68.9M 36.6 16.9 39.8 54.5 41.5 24.1 45.2 55.2 76
ResNeXt-101 [33] 114.3M 38.3 18.4 41.6 55.4 43.1 26.1 46.8 55.7 157
VoVNetV2-99 96.9M 38.3 18.0 41.8 56.0 43.5 25.8 47.8 57.3 106
Table 3: CenterMask with other backbones on COCO val2017. Note that all mdoels are trained with a same manner (e.g., 12 epoch, 16 batch size, without train & test augmentation). The inference time is reported on same Titan Xp GPU.
Feature Level APmask APbox
P3 P7 34.4 38.8
P3 P6 34.6 38.8
P3 P5 34.6 38.9
P3 P4 34.4 38.5
Table 4: Feature level ranges for RoIAlign [9] in CenterMmask. P3P7 denotes the feature maps with output stride of

3 Experiments

In this section, we evaluate the effectiveness of CenterMask on COCO [21] benchmarks. All models are trained on the train2017 and val2017 are used for ablation studies. Final results are reported on test-dev for comparison with state-of-the-arts. We use APmask as mask average precision AP (averaged over IoU thresholds), APS, APM, and APL (AP at different scale). We also denote box AP as APbox. All ablation studies are conducted using CenterMask with ResNet-50-FPN exception for the backbone experiment in Table 3. Unless specified, we report the inference time of models using one thread (1 batch size) on the same workstation equipped with Titan Xp GPU, CUDA v10.0, cuDNN v7.3, and pytorch1.1. The Qualitative results of CenterMask are shown in Figure 4 and quantitative results are followed.

3.1 Ablation study

Scale-adaptive RoI assignment function: Comparing to Equation 1, we validate the proposed Equation 2 in CenterMask. Table 2 shows that our scale-adaptive RoI assignment function considering the input scale improves by 0.2%AP over the counterpart. It means that Equation 2 regarding the ratio of input/RoI is more scale-adaptive than Equation 1. We note that since RoI assignment occurs after detecting boxes, the APbox is unchanged.

We also ablate which feature level range is suitable for our CenterMask based one-stage detector. Since FCOS detector extract features from P3 P7, we start the same feature levels in the SAG-mask branch. As shown in Table 4, the performance of the P3 P7 range is not as good as other ranges. We speculate P7 feature map is too small to extract fine features for pixel-level prediction (e.g., ). We observe that P3 P5 feature range achieves the best result, which means feature maps with a bigger resolution are advantageous for the mask prediction.

Method Backbone epochs APmask AP AP AP APbox AP AP AP Time FPS GPU
Mask R-CNN, ours R-101-FPN 24 37.9 18.1 40.3 53.3 42.2 24.9 45.2 52.7 94 10.6 V100
ShapeMask [15] R-101-FPN N/A 37.4 16.1 40.1 53.8 42.0 24.3 45.2 53.1 125 8.0 V100
TensorMask [5] R-101-FPN 72 37.1 17.4 39.1 51.6 - - - - 380 2.6 V100
RetinaMask [7] R-101-FPN 24 34.7 14.3 36.7 50.5 41.4 23.0 44.5 53.0 98 10.2 V100
CenterMask R-101-FPN 24 38.3 17.7 40.8 54.5 43.1 25.2 46.1 54.4 72 13.9 V100
YOLACT-400 [1] R-101-FPN 48 24.9 5.0 25.3 45.0 28.4 10.7 28.9 43.1 22 45.5 Xp
CenterMask-Lite M-v2-FPN 24 25.2 8.6 25.8 38.2 28.8 14 30.7 37.8 20 50.0 Xp
YOLACT-550 [1] R-50-FPN 48 28.2 9.2 29.3 44.8 30.3 14.0 31.2 43.0 23 43.5 Xp
CenterMask-Lite V-19-FPN 24 28.9 11.2 30.0 43.1 32.1 17 34.4 41.5 23 43.5 Xp
YOLACT-550 [1] R-101-FPN 48 29.8 9.9 31.3 47.7 31.0 14.4 31.8 43.7 30 33.3 Xp
YOLACT-700 [1] R-101-FPN 48 31.2 12.1 33.3 47.1 33.7 16.8 35.6 45.7 42 23.8 Xp
CenterMask-Lite R-50-FPN 24 31.9 12.4 33.8 47.3 35.3 18.2 38.6 46.2 29 34.5 Xp
CenterMask-Lite V-39-FPN 24 33.4 13.4 35.2 49.5 38.0 20.3 40.9 49.8 28 35.7 Xp
Table 5: CenterMask instance segmentation and detection performance on COCO tes-dev2017. Mask R-CNN, RetinaMask, and CenterMask are implemented on the same base code [25]. R, V, X, and M denote ResNet, VoVNetV2, ResNeXt, and MobileNetV2.

Spatial Attention Guided Mask: Table 2 demonstrates the influence of each component in building Spatial Attention Guided Mask (SAG-Mask). The baseline, FCOS object detector, starts from 38.1% APbox with the run time of 56 ms. Adding only naive mask head improves the box performance by 0.6% APbox and obtains 33.8% APmask. With the prementioned scale-adaptive RoI mapping strategy, our spatial attention module, SAM, makes the mask performance forward because the spatial attention module helps the mask predictor to focus on informative pixels but also suppress noise. It can also be seen that the detection performance is boosted when using SAM. We suggest that result from the SAM, the refined feature maps of mask head would also have a secondary effect on the detection branch that shares feature maps of the backbone.

Since SAG-mask has a similar structure with Mask R-CNN for mask prediction, it can also deploy the mask scoring  [14] that recalibrates the score regarding the predicted mask IoU. As a result, the mask scoring increases performance by 0.5% APbox. We note that the mask scoring cannot boost detection performance because the recalibrated mask score adjusts the ranks of mask results in the evaluation step, not refines the features of the mask head like the SAM. Besides, SAM rarely causes extra computation while the mask scoring leads to computation overhead (e.g., +5ms).

VoVNetV2: We extend VoVNet to VoVNetV2 by using residual connection and the proposed effective SE (eSE) module into the VoVNet. Table 2 shows residual connection improves both VoVNet-39/-57. In particular, the reason that the improved AP margin of VoVNet-57 is bigger than VoVNet-39 is that VoVNet-57 comprised of more OSA modules can have more effect of residual connection that alleviates the optimization problem.

To validate eSE, we also apply SE [12] to the VoVNet and compare it with the proposed eSE. As shown in Table  2, SE worsens the performance of VoVNet or has no effect because the diversified feature map of OSA module losses channel information due to channel dimension reduction in SE. Contrary to SE, our eSE maintaining channel information using only 1 FC layer boosts both APmask and APbox from VoVNetV1 with slight computation.

Comparison to other backbones: We expand VoVNetV2 on various scales; large (V-99), base (V-39/57), and lightweight (V-19) which correspond to ResNeXt-32-8d, ResNet-50/101 & HRNet-W18/W32, and MobileNetV2, respectively. Table 3 and Figure 1 demonstrate VoVNetV2 is well-balanced backbone network in terms of accuracy and speed. While VoVNetV1-39 already outperforms its counterparts, VoVNetV2-39 shows better performance than ResNet-50/HRNet-W18 by a large margin of 1.2%/2.6% at faster speeds, respectively. Especially, the gain of APbox is bigger than APmask, 1.5%/3.3%, respectively. A similar result pattern is shown in VoVNetV2-57 with its counterparts.

For large model, showing much faster run time (), VoVNetV2-99 achieves competitive APmask or higher APbox than ResNeXt-101-32x8d despite fewer model parameters. For small model, VoVNetV2-19 outperforms MobileNetV2 by a large margin of 1.7% APmask/3.3%APbox, with comparable speed.

Figure 4: Results of CenterMask with VoVNetV2-99 on COCO test-dev2017.

3.2 Comparison with state-of-the-arts methods

For further validation of the CenterMask, we compare the proposed CenterMask with state-of-the-art instance segmentation methods. As most methods [23, 5, 8, 1, 7] use train augmentation, we also adopt the scale-jitter where the shorter image side is randomly sampled from [640, 800] pixels [8]. For Centermask-Lite, [580, 600] scale jittering is used and in test time shorter Although many methods [23, 27, 8, 35, 5, 1] shows that longer training schedule (e.g., over 48 epochs) boosts performance, we use schedule ( epochs) for efficient train time, which leaves room for further performance improvement. We note that we do not use test-time augmentation [8]. The other hyper-parameters are kept same as ablation study. For fair speed comparison, we inference models on the same GPU as counterparts. Specifically, since most Large models are tested on V100 GPU and YOLACT [1] models are reported on Titan Xp GPU, we also report large CenterMask models on V100 and CenterMask-Lite models on Xp.

Under the same ResNet-101 backbone, CenterMask outperforms all other counterparts in terms of both accuracy (APmask, APbox) and speed. In particular, compared to RetinaMask [7] that has similar architecture (i.g., one-stage detector + mask branch), CenterMask achieves 3.6%APmask gain. In less than half training epochs, CenterMask also surpasses the dense sliding window method, TensorMask [5], by 1.2%APmask at faster speed.

We also compare with YOLACT [1] that is the representative real-time instance segmentation. We use four kinds of backbones (e.g., MobileNetV2, VoVNetV2-19, VoVNetV2-39, and ResNet-50), which have a different accuracy-speed tradeoff. Table 5 and Figure 1 (bottom) demonstrate CenterMask-Lite is superior to YOLACT in terms of accuracy and speed. All CenterMask-Lite models achieve over 30 fps speed with a large margin of both APmask and APbox, while YOLACT also has over 30fps speed except YOLACT700-ResNet-101. We note that if we train our CenterMask as long as YOLACT, it can obtain further performance gain.

4 Discussion

In Table  5, we observe that using the same ResNet-101 backbone, Mask R-CNN shows better performance than CenterMask on AP. We conjecture that Mask R-CNN uses larger feature maps (P2) than CenterMask (P3) in which the mask branch can extract much finer spatial layout of an object than the P3 feature map. We note that there are still rooms for improving one-stage instance segmentation performance like techniques [2, 3] of Mask R-CNN.

5 Conclusion

We have proposed a real-time anchor-free one-stage instance segmentation and more effective backbone networks. Adding spatial attention guided mask branch to the anchor-free one stage instance detection, CenterMask achieves state-of-the-art performance at real-time speed. The newly proposed VoVNetV2 backbone spanning from lightweight to larger models makes CenterMask well-balanced performance in terms of speed and accuracy. We hope CenterMask will serve as a baseline for real-time instance segmentation. We also believe our proposed VoVNetV2 can be used as a strong and efficient backbone network for various vision tasks.


  • [1] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee (2019-10) YOLACT: real-time instance segmentation. In

    The IEEE International Conference on Computer Vision (ICCV)

    Cited by: Figure 1, §1, §1, §3.2, §3.2, Table 5.
  • [2] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §1, §4.
  • [3] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4974–4983. Cited by: §1, §4.
  • [4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017)

    Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667. Cited by: §2.4.
  • [5] X. Chen, R. Girshick, K. He, and P. Dollar (2019-10) TensorMask: a foundation for dense object segmentation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, §3.2, §3.2, Table 5.
  • [6] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019-10) CenterNet: keypoint triplets for object detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [7] C. Fu, M. Shvets, and A. C. Berg (2019) RetinaMask: learning to predict masks improves state-of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353. Cited by: Figure 1, §3.2, §3.2, Table 5.
  • [8] K. He, R. Girshick, and P. Dollar (2019-10) Rethinking imagenet pre-training. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.2.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: Figure 1, §1, §1, §2.3, §2.4, §2.6, Table 4.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §2.5, §2.5, Table 2, Table 3.
  • [11] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi (2018)

    Gather-excite: exploiting feature context in convolutional neural networks

    In Advances in Neural Information Processing Systems, pp. 9401–9411. Cited by: §2.4.
  • [12] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2.4, §2.5, Table 2, §3.1.
  • [13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Cited by: §1.
  • [14] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019) Mask scoring r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6409–6418. Cited by: §1, §2.6, Table 2, §3.1.
  • [15] W. Kuo, A. Angelova, J. Malik, and T. Lin (2019-10) ShapeMask: learning to segment novel objects by refining shape priors. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Figure 1, Table 5.
  • [16] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, pp. 734–750. Cited by: §1.
  • [17] Y. Lee, J. Hwang, S. Lee, Y. Bae, and J. Park (2019) An energy and gpu-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §1, §2.5, Table 3.
  • [18] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892. Cited by: §1.
  • [19] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection.. In CVPR, Cited by: §1, §2.3, §2.3, §2.6, Table 2.
  • [20] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. ICCV. Cited by: §1, §2.1, §2.3.
  • [21] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §1, §3.
  • [22] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: §1.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, Cited by: §3.2.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.1.
  • [25] F. Massa and R. Girshick (2018)

    maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch

    Note: Cited by: Table 5.
  • [26] Z. Qin, Z. Li, Z. Zhang, Y. Bao, G. Yu, Y. Peng, and J. Sun (2019-10) ThunderNet: towards real-time generic object detection on mobile devices. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.4.
  • [27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2.1, §3.2.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §2.1.
  • [29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: Table 3.
  • [30] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514. Cited by: Table 3.
  • [31] Z. Tian, C. Shen, H. Chen, and T. He (2019-10) FCOS: fully convolutional one-stage object detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §1, §2.3, §2.6, §2.6.
  • [32] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.4.
  • [33] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: Table 3.
  • [34] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §1.
  • [35] X. Zhou, J. Zhuo, and P. Krahenbuhl (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. Cited by: §1, §3.2.
  • [36] X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai (2019-10) An empirical study of spatial attention mechanisms in deep networks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.4.