Hybrid Task Cascade for Instance Segmentation

by   Kai Chen, et al.

Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4 Mask R-CNN baseline on MSCOCO dataset. More importantly, our overall system achieves 48.6 mask AP on the test-challenge dataset and 49.0 mask AP on test-dev, which are the state-of-the-art performance.


QueryInst: Parallelly Supervised Mask Query for Instance Segmentation

Recently, query based object detection frameworks achieve comparable per...

Equalization Loss for Large Vocabulary Instance Segmentation

Recent object detection and instance segmentation tasks mainly focus on ...

CenterMask : Real-Time Anchor-Free Instance Segmentation

We propose a simple yet efficient anchor-free instance segmentation, cal...

CNN Cascades for Segmenting Whole Slide Images of the Kidney

Due to the increasing availability of whole slide scanners facilitating ...

Seesaw Loss for Long-Tailed Instance Segmentation

This report presents the approach used in the submission of the LVIS Cha...

1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask

This article introduces the solutions of the team lvisTraveler for LVIS ...

SCNet: Training Inference Sample Consistency for Instance Segmentation

Cascaded architectures have brought significant performance improvement ...

Code Repositories


The First Place Solution of Kaggle iMaterialist (Fashion) 2019 at FGVC6

view repo

1 Introduction

Instance segmentation is a fundamental computer vision task that performs per-pixel labeling of objects at instance level. Achieving accurate and robust instance segmentation in real-world scenarios such as autonomous driving and video surveillance is challenging. Firstly, visual objects are often subject to deformation, occlusion and scale changes. Secondly, background clutters make object instances hard to be isolated. To tackle these issues, we need a robust representation that is resilient to appearance variations. At the same time, it needs to capture rich contextual information for discriminating objects from cluttered background.

Cascade is a classic yet powerful architecture that has boosted performance on various tasks by multi-stage refinement. Cascade R-CNN [5] presented a multi-stage architecture for object detection and achieved promising results. The success of Cascade R-CNN can be ascribed to two key aspects: (1) progressive refinement of predictions and (2) adaptive handling of training distributions.

Though being effective on detection tasks, integrating the idea of cascade into instance segmentation is nontrivial. A direct combination of Cascade R-CNN and Mask R-CNN [18] only brings limited gain in terms of mask AP compared to bbox AP. Specifically, it improves bbox AP by but mask AP by , as shown in Table 1. An important reason for this large gap is the suboptimal information flow among mask branches of different stages. Mask branches in later stages only benefit from better localized bounding boxes, without direct connections.

To bridge this gap, we propose Hybrid Task Cascade (HTC), a new cascade architecture for instance segmentation. The key idea is to improve the information flow by incorporating cascade and multi-tasking at each stage and leverage spatial context to further boost the accuracy. Specifically, we design a cascaded pipeline for progressive refinement. At each stage, both bounding box regression and mask prediction are combined in a multi-tasking manner. Moreover, direct connections are introduced between the mask branches at different stages – the mask features of each stage will be embedded and fed to the next one, as demonstrated in Figure 2. The overall design strengthens the information flow between tasks and across stages, leading to better refinement at each stage and more accurate predictions on all tasks.

For object detection, the scene context also provides useful clues, e.g. for inferring the categories, scales, etc. To leverage this context, we incorporate a fully convolutional branch that performs pixel-level stuff segmentation. This branch encodes contextual information, not only from foreground instances but also from background regions, thus complementing the bounding boxes and instance masks. Our study shows that the use of the spatial contexts helps to learn more discriminative features.

Hybrid Task Cascade (HTC) is easy to implement and can be trained end-to-end. Without bells and whistles, it achieves and higher mask AP than Mask R-CNN and Cascade Mask R-CNN baselines respectively on the challenging COCO dataset. Together with better backbones and other common components, e.g. deformable convolution, multi-scale training and testing, model ensembling, we achieve mask AP on test-dev dataset, which is 2.3% higher than the winning approach [27] of COCO Challenge 2017.

Our main contributions are summarized as follows: (1) We propose Hybrid Task Cascade (HTC), which effectively integrates cascade into instance segmentation by interweaving detection and segmentation features together for a joint multi-stage processing. It achieves the state-of-the-art performance on COCO test-dev and test-challenge. (2) We demonstrate that spatial contexts benefit instance segmentation by discriminating foreground objects from background clutters. (3) We perform extensive study on various components and designs, which provides a reference and is helpful for futher research on object detection.

2 Related Work

Instance segmentation.

Instance segmentation is a task to localize objects of interest in an image at the pixel-level, where segmented objects are generally represented by masks. This task is closely related to both object detection and semantic segmentation. Hence, existing methods for this task roughly fall into two categories, namely detection-based and segmentation-based.

Detection-based methods resort to a conventional detector to generate bounding boxes or region proposals, and then predict the object masks within the bounding boxes. Many of these methods are based on CNN, including DeepMask 

[33], SharpMask [34], and InstanceFCN [10]. MNC [11] formulates instance segmentation as a pipeline that consists of three sub-tasks: instance localization, mask prediction and object categorization, and trains the whole network end-to-end in a cascaded manner. In a recent work, FCIS [22] extends InstanceFCN and presents a fully convolutional approach for instance segmentation. Mask-RCNN [18] adds an extra branch based on Faster R-CNN [36] to obtain pixel-level mask predictions, which shows that a simple pipeline can yield promising results. PANet [27] adds a bottom-up path besides the top-down path in FPN [23] to facilitate the information flow. MaskLab [7] produces instance-aware masks by combining semantic and direction predictions.

Segmentation-based methods, on the contrary, first obtains a pixel-level segmentation map over the image, and then identifies object instances therefrom. Along this line, Zhang et al[43, 42] propose to predict instance labels based on local patches and integrate the local results with an MRF. Arnab and Torr [1] also use CRF to identify instances. Bai and Urtasun [2]

propose an alternative way, which combines watershed transform and deep learning to produce an energy map, and then derive the instances by dividing the output of the watershed transform. Other approaches include bridging category-leval and instance-level segmentation 

[39], learning a boundary-aware mask representation [17]

, and employing a sequence of neural networks to deal with different sub-grouping problems 


Multi-stage object detection.

The past several years have seen remarkable progress in object detection. Mainstream object detection frameworks are often categorized into two types, single-stage, e.g., SSD [28], YOLO [35], RetinaNet [24] and two-stage, e.g., Faster R-CNN [36], R-FCN [12], Mask R-CNN [18].

Recently, detection frameworks with multiple stages emerge as an increasingly popular paradigm for object detection. Multi-region CNN [14] incorporates an iterative localization mechanism that alternates between box scoring and location refinement. AttractioNet [15] introduces an Attend & Refine module to update bounding box locations iteratively. CRAFT [41] incorporates a cascade structure into RPN [36] and Fast R-CNN [16] to improve the quality of the proposal and detection results. IoU-Net [20] performs progressive bounding box refinement (even though not presenting a cascade structure explicitly). Cascade structures are also used to exclude easy negative samples. For example, CC-Net [29] rejects easy RoIs at shallow layers. Li et al[21] propose to operate at multiple resolutions to reject simple samples. Among all the works that use cascade structures, Cascade R-CNN [5] is perhaps the most relevant to ours. Cascade R-CNN comprises multiple stages, where the output of each stage is fed into the next one for higher quality refinement. Moreover, the training data of each stage is sampled with increasing IoU thresholds, which inherently handles different training distributions.

While the proposed framework also adopts a cascade structure, it differs in several important aspects. First, multiple tasks, including detection, mask prediction, and semantic segmentation, are combined at each stage, thus forming a joint multi-stage processing pipeline. In this way, the refinement at each stage benefits from the reciprocal relations among these tasks. Moreover, contextual information is leveraged through an additional branch for stuff segmentation and a direction path is added to allow direct information flow across stages.

3 Hybrid Task Cascade

(a) Cascade Mask R-CNN
(b) Interleaved execution
(c) Mask information flow
(d) Hybrid Task Cascade (semantic feature fusion with box branches is not shown on the figure for neat presentation.)
Figure 1: The architecture evolution from Cascade Mask R-CNN to Hybrid Task Cascade.

Cascade demonstrated its effectiveness on various tasks such as object detection [5]. However, it is non-trivial to design a successful architecture for instance segmentation. In this work, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation.

Overview. In this work, we propose Hybrid Task Cascade (HTC), a new framework of instance segmentation. Compared to existing frameworks, it is distinctive in several aspects: (1) It interleaves bounding box regression and mask prediction instead of executing them in parallel. (2) It incorporates a direct path to reinforce the information flow between mask branches by feeding the mask features of the preceding stage to the current one. (3) It aims to explore more contextual information by adding an additional semantic segmentation branch and fusing it with box and mask branches. Overall, these changes to the framework architecture effectively improve the information flow, not only across stages but also between tasks.

3.1 Multi-task Cascade

Figure 2: Architecture of multi-stage mask branches.

Cascade Mask R-CNN.

We begin with a direct combination of Mask R-CNN and Cascade R-CNN, denoted as Cascade Mask R-CNN. Specifically, a mask branch following the architecture of Mask R-CNN is added to each stage of Cascade R-CNN, as shown in Figure 0(a). The pipeline is formulated as:


Here, indicates the CNN features of backbone network, and indicates box and mask features derived from and the input RoIs. is a pooling operator, e.g., RoI Align or ROI pooling, and denote the box and mask head at the -th stage, and represent the corresponding box predictions and mask predictions. By combining the advantages of cascaded refinement and the mutual benefits between bounding box and mask predictions, this design improves the box AP, compared to Mask R-CNN and Cascade R-CNN alone. However, the mask prediction performance remains unsatisfying.

Interleaved Execution.

One drawback of the above design is that the two branches at each stage are executed in parallel, both taking the bounding box predictions from the preceding stage as input. Consequently, the two branches are not directly interact within a stage. In response to this issue, we explore an improved design, which interleaves the box and mask branches, as illustrated in Figure 0(b). The interleaved execution can be expressed as:


In this way, the mask branch can take advantage of the updated bounding box predictions. We found that this yields improved performance.

Mask Information Flow.

In the design above, the mask prediction at each stage is based purely on the ROI features and the box prediction . There is no direct information flow between mask branches at different stages, which prevents further improvements on mask prediction accuracy. Towards a good design of mask information flow, we first recall the design of the cascaded box branches in Cascade R-CNN [5]. An important point is the input feature of box branch is jointly determined by the output of the preceding stage and backbone. Following similar principles, we introduce an information flow between mask branches by feeding the mask features of the preceding stage to the current stage, as illustrated in Figure 0(c). With the direct path between mask branches, the pipeline can be written as:


where denotes the intermediate feature of and we use it as the mask representation of stage . is a function to combine the features of the current stage and the preceding one. This information flow makes it possible for progressive refinement of masks, instead of predicting masks on progressively refined bounding boxes.


Following the discussion above, we propose a simple implementation as below.


In this implementation, we adopt the RoI feature before the deconvolutional layer as the mask representation , whose spatial size is . At stage , we need to forward all preceding mask heads to compute .


Here, denotes the feature transformation component of the mask head , which is comprised of consecutive convolutional layers, as shown in Figure 2. The transformed features are then embedded with a convolutional layer in order to be aligned with the pooled backbone features . Finally, is added to

through element-wise sum. With this introduced bridge, adjacent mask branches are brought into direct interaction. Mask features in different stages are no longer isolated and all get supervised through backpropagation.

3.2 Spatial Contexts from Segmentation

Figure 3: We introduce complementary contextual information by adding semantic segmentation branch.

To further help distinguishing the foreground from the cluttered background, we use the spatial contexts as an effective cue. We add an additional branch to predict per-pixel semantic segmentation for the whole image, which adopts the fully convolutional architecture and is jointly trained with other branches, as shown in Figure 0(d). The semantic segmentation feature is a strong complement to existing box and mask features, thus we combine them together for better predictions:



indicates the semantic segmentation head. In the above formulation, the box and mask heads of each stage take not only the RoI features extracted from the backbone as input, but also exploit semantic features, which can be more discriminative on cluttered background.

Semantic Segmentation Branch.

Specifically, the semantic segmentation branch is constructed based on the output of the Feature Pyramid [23]. Note that for semantic segmentation, the features at a single level may not be able to provide enough discriminative power. Hence, our design incorporates the features at multiple levels. In addition to the mid-level features, we also incorporate higher-level features with global information and lower-level features with local information for better feature representation.

Figure 3 shows the architecture of this branch. Each level of the feature pyramid is first aligned to a common representation space via a

convolutional layer. Then low level feature maps are upsampled, and high level feature maps are downsampled to the same spatial scale, where the stride is set to

. We found empirically that this setting is sufficient for fine pixel-level predictions on the whole image. These transformed feature maps from different levels are subsequently fused by element-wise sum. Moreover, we add four convolutional layers thereon to further bridge the semantic gap. At the end, we simply adopt a convolutional layer to predict the pixel-wise segmentation map. Overall, we try to keep the design of semantic segmentation branch simple and straightforward. Though a more delicate structure can further improve the performance, It goes beyond our scope and we leave it for future work.

Fusing Contexts Feature into Main Framework. It is well known that joint training of closely related tasks can improve feature representation and bring performance gains to original tasks. Here, we propose to fuse the semantic features with box/mask features to allow more interaction between different branches. In this way, the semantic branch directly contributes to the prediction of bounding boxes and masks with the encoded spatial contexts. Following the standard practice, given a RoI, we use RoIAlign to extract a small (e.g., or ) feature patch from the corresponding level of feature pyramid outputs as the representation. At the same time, we also apply RoIAlign on the feature map of the semantic branch and obtain a feature patch of the same shape, and then combine the features from both branches by element-wise sum.

3.3 Learning

Since all the modules described above are differentiable, Hybrid Task Cascade (HTC) can be trained in an end-to-end manner. At each stage , the box head predicts the classification score and regression offset for all sampled RoIs. The mask head predicts pixel-wise masks for positive RoIs. The semantic branch predicts a full image semantic segmentation map

. The overall loss function takes the form of a multi-task learning:


Here, is the loss of the bounding box predictions at stage , which follows the same definition as in Cascade R-CNN [5] and combines two terms and , respectively for classification and bounding box regression. is the loss of mask prediction at stage , which adopts the binary cross entropy form as in Mask R-CNN [18]. is the semantic segmentation loss in the form of cross entropy. The coefficients and

are used to balance the contributions of different stages and tasks. We follow the hyperparameter settings in Cascade R-CNN 

[5]. Unless otherwise noted, we set , and by default.

4 Experiments

Method Backbone box AP mask AP runtime (fps)
Mask R-CNN [18] ResNet-50-FPN 39.1 35.6 57.6 38.1 18.7 38.3 46.6 5.3
PANet[27] ResNet-50-FPN 41.2 36.6 58.0 39.3 16.3 38.1 52.4 -
Cascade Mask R-CNN ResNet-50-FPN 42.7 36.9 58.6 39.7 19.6 39.3 48.8 3.0
Cascade Mask R-CNN ResNet-101-FPN 44.4 38.4 60.2 41.4 20.2 41.0 50.6 2.9
Cascade Mask R-CNN ResNeXt-101-FPN 46.6 40.1 62.7 43.4 22.0 42.8 52.9 2.5
HTC (ours) ResNet-50-FPN 43.6 38.4 60.0 41.5 20.4 40.7 51.2 2.5
HTC (ours) ResNet-101-FPN 45.3 39.7 61.8 43.1 21.0 42.2 53.5 2.4
HTC (ours) ResNeXt-101-FPN 47.1 41.2 63.9 44.7 22.8 43.9 54.6 2.1
Table 1: Comparison with state-of-the-art methods on COCO test-dev dataset.

4.1 Datasets and Evaluation Metrics

Datasets. We perform experiments on the challenging instance segmentation dataset: COCO dataset [25]. We train our models on the split of 2017train (115k images) and report results on 2017val and 2017test-dev. Typical instance annotations are used to supervise box and mask branches, and the semantic branch is supervised by COCO-stuff [4] annotations.

Evaluation Metrics. We report the standard COCO-style Average Precision (AP) metrics which average APs across IoU thresholds from to with an interval of . Both box AP and mask AP are evaluated. For mask AP, we also report , (AP at different IoU thresholds) and , , (AP at different scales). Runtime is measured on a single TITAN Xp GPU.

4.2 Implementation Details

In all experiments, we adopt a 3-stage cascade and FPN is used in all backbones. For fair comparison, Mask R-CNN and Cascade R-CNN are reimplemented with PyTorch 

[30] and mmdetection [6], which are slightly higher than the reported performance in the original papers. We train detectors with 16 GPUs (one image per GPU) for epochs with an initial learning rate of , and decrease it by after and epochs, respectively. The long edge and short edge of images are resized to and respectively without changing the aspect ratio.

During inference, object proposals are refined progressively by box heads of different stages. Classification scores of multiple stages are ensembled as in Cascade R-CNN. Mask branches are only applied to detection boxes with higher scores than a threshold and are also ensembled.

4.3 Benchmarking Results

We compare Hybrid Task Cascade with the state-of-the-art instance segmentation approaches on the COCO dataset in Table 1. We also evaluate Cascade Mask R-CNN as a strong baseline of our method, which is described in Section 1. Compared to Mask R-CNN, the naive cascaded baseline brings and gain in terms of box AP and mask AP, respectively. It is noted that this baseline is already higher than PANet [27], the state-of-the-art instance segmentation method. Our HTC achieves consistent improvements on different backbones, proving its effectiveness. It achieves a gain of , and for ResNet-50-FPN, ResNet-101-FPN and ResNeXt-101-FPN, respectively.

4.4 Ablation Study

Component-wise Analysis. Firstly, we investigate the effects of two main components in our framework. “Interleaved” denotes the interleaved execution of bbox and mask branches, “Mask” indicates the introduction of mask branch information flow and “Semantic” means adding the semantic segmentation branch. From Table 2, we can learn that the interleaved execution slightly improves the mask AP by . The mask information flow contributes to a further improvement, and semantic segmentation branch leads to a gain of .

Cascade Interleaved Mask Info Semantic box AP mask AP
42.5 36.5 57.9 39.4 18.9 39.5 50.8
42.5 36.7 57.7 39.4 18.9 39.7 50.8
42.5 37.4 58.1 40.3 19.6 40.3 51.5
43.2 38.0 59.4 40.7 20.3 40.9 52.3
Table 2: Effects of each component in our design. Results are reported on COCO 2017 val.

Effectiveness of Interleaved Branch Execution. In Section 3.1, we designed the interleaved branch execution to benefit the mask branch from updated bounding boxes. To investigate the effeciveness of this strategy, we compare it with the conventional parallel execution pipeline on both Mask R-CNN and Cascade Mask R-CNN. As shown in Table 3, interleaved execution performs better than parallel execution on both methods, with and improvements respectively.

Method execution box AP mask AP
Mask R-CNN parallel 38.4 35.1 56.6 37.4 18.7 38.4 47.7
interleaved 38.7 35.6 57.2 37.9 19.0 39.0 48.3
Cascade Mask R-CNN parallel 42.5 36.5 57.9 39.4 18.9 39.5 50.8
interleaved 42.5 36.7 57.7 39.4 18.9 39.7 50.8
Table 3: Results of parallel/interleaved branch execution on different methods.

Effectiveness of Mask Information Flow. We study how the introduced mask information flow helps mask prediction by comparing stage-wise performance. Semantic segmentation branch is not added to exclude possible distraction. From Table 4, we find that introducing the mask information flow greatly improves the the mask AP in the second stages. Without direct connections between mask branches, the second stage only benefits from better localized bounding boxes, so the improvement is limited (). With the mask information flow, the gain is more significant (), because it makes each stage aware of the preceding stage’s features. Similar to Cascade R-CNN, stage 3 does not outperforms stage 2, but it contributes to ensembled performance.

IF test stage AP
N stage 1 35.5 56.7 37.8 18.7 38.8 48.6
stage 2 36.3 57.5 39.0 18.8 39.4 50.6
stage 3 35.9 56.5 38.7 18.2 39.1 49.9
stage 36.7 57.7 39.4 18.9 39.7 50.8
Y stage 1 35.5 56.8 37.8 19.0 38.8 49.0
stage 2 37.0 58.0 39.8 19.4 39.8 51.3
stage 3 36.8 57.2 39.9 18.7 39.8 51.1
stage 37.4 58.1 40.3 19.6 40.3 51.5
Table 4: Effects of the mask information flow. We evaluate stage-wise performance as well as ensembled results with or without the information flow.

Effectiveness of Semantic Feature Fusion. We exploit contextual features by introducing a semantic segmentation branch and fuse the features of different branches. Multi-task learning is known to be beneficial, here we study the necessity of semantic feature fusion. We train different models that fuse semantic features with box or mask or both branches, and the results are shown in Table 5. It is noted that simply adding a full image segmentation task achieves improvement, mainly resulting from additional supervision. Feature fusion also contributes to further gains,e.g., fusing the semantic features with both the box and mask branches brings an extra gain, which indicates that complementary information increases feature discrimination for box and mask branches.

Fusion AP
- 36.5 57.9 39.4 18.9 39.5 50.8
none 37.1 58.6 39.9 19.3 40.0 51.7
bbox 37.3 58.9 40.2 19.4 40.2 52.3
mask 37.4 58.7 40.2 19.4 40.1 52.4
both 37.5 59.1 40.4 19.6 40.3 52.6
Table 5: Ablation study of semantic feature fusion on COCO 2017 val.

Influence of Loss Weight. A new hyperparameter is introduced in HTC, since we involve one more task for joint training. We tested different loss weight for the semantic branch, as shown in Table 6. Results show that our method is not sensitive to the loss weight hyperparameter.

0.5 37.9 59.3 40.7 19.7 41.0 52.5
1 38.0 59.4 40.7 20.3 40.9 52.3
2 37.9 59.3 40.6 19.6 40.8 52.8
3 37.8 59.0 40.5 19.9 40.5 53.2
Table 6: Ablation study of semantic branch loss weight on COCO 2017 val.

4.5 Extensions on HTC

With the proposed HTC, we achieve mask AP and absolute improvement compared to the winning entry last year. Here we list all the tricks and additional modules used to obtain the performance. The step-by-step gains brought by each component are illustrated in Table 7.

HTC Baseline. The ResNet-50 baseline achieves mask AP.

DCN. We adopt deformable convolution [13] in the last stage of backbones.


Synchronized Batch Normalization 

[31, 27] is used in the backbone and heads.

Multi-scale Training. We apply multi-scale training. In each iteration, the scale of short edge is randomly sampled from , and the scale of long edge is fixed as .

SENet-154. We tried different backbones besides ResNet-50, and SENet-154 [19] achieves best single model performance among them.

GA-RPN. We finetune trained detectors with the proposals generated by a guided anchoring scheme combined with RPN (GA-RPN) [38], which achieves near 10% higher recall than RPN.

Multi-scale Testing. We use 5 scales as well as horizontal flip at test time and ensemble the results. Testing scales are (600, 900), (800, 1200), (1000, 1500), (1200, 1800), (1400, 2100).

Ensemble. We utilize an emsemble of five networks: SENet-154 [19], ResNeXt-101 [40] 64*4d, ResNeXt-101 32*8d, DPN-107 [9], FishNet [37].

2017 winner [27] 46.7 69.5 51.3 26.0 49.1 64.0
Ours 49.0 73.0 53.9 33.9 52.3 61.2
HTC baseline 38.4 60.0 41.5 20.4 40.7 51.2
+ DCN 39.5 61.3 42.8 20.9 41.8 52.7
+ SyncBN 40.7 62.8 44.2 22.2 43.1 54.4
+ ms train 42.5 64.8 46.4 23.7 45.3 56.7
+ SENet-154 44.3 67.5 48.3 25.0 47.5 58.9
+ GA-RPN 45.3 68.9 49.4 27.0 48.3 59.6
+ ms test 47.4 70.6 52.1 30.2 50.1 61.8
+ ensemble 49.0 73.0 53.9 33.9 52.3 61.2
Table 7: Results (mask AP) with better backbones and bells and whistles on COCO test-dev dataset.
Figure 4: Examples of segmentation results on COCO dataset.

4.6 Extensive Study on Common Modules

We also perform extensive study on some components designed for detection and segmentation. Components are often compared under different conditions such as backbones, codebase, etc. Here we provide a unified environment with state-of-the-art object detection and instance segmentation framework to investigate the functionality of extensive components. We integrate several common modules designed for detection and segmentation and evaluate them under the same settings, and the results are shown in Table 8. Limited by our experience and resources, some implementations and the integration may not be optimal and worth further study. Code will be released as a benchmark to test more components.

  1. [leftmargin=*]

  2. ASPP. We adopt Atrous Spatial Pyramid Pooling (ASPP) [8] module from the semantic segmentation community to capture more image context at multiple scales. We append an ASPP module after FPN.

  3. PAFPN. We test the PAFPN module from PANet [27]. The difference from the original implementation is that we do not use Synchronized BatchNorm.

  4. GCN. We adopt Global Convolutional Network (GCN) [32] in the semantic segmentation branch.

  5. PreciseRoIPooling. We replace the RoI align layers in HTC with Precise RoI Pooling [20].

  6. SoftNMS. We use SoftNMS [3] instead NMS for box heads.

Method AP
HTC 38.0 59.4 40.7 20.3 40.9 52.3
HTC+ASPP 38.1 59.9 41.0 20.0 41.2 52.8
HTC+PAFPN 38.1 59.5 41.0 20.0 41.2 53.0
HTC+GCN 37.9 59.2 40.7 20.0 40.6 52.3
HTC+PrRoIPool 37.9 59.1 40.9 19.7 40.9 52.7
HTC+SoftNMS 38.3 59.6 41.2 20.4 41.2 52.7
Table 8: Extensive study on related modules on COCO 2017 val.

5 Conclusion

We propose Hybrid Task Cascade (HTC), a new cascade architecture for instance segmentation. It interweaves box and mask branches for a joint multi-stage processing, and adopts a semantic segmentation branch to provide spatial context. This framework progressively refines mask predictions and integrates complementary features together in each stage. Without bells and whistles, the proposed method obtains improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Notably, our overall system achieves mask AP on the test-challenge dataset and mask AP on test-dev.


  • [1] A. Arnab and P. H. Torr. Bottom-up instance segmentation using deep higher-order crfs. arXiv preprint arXiv:1609.02583, 2016.
  • [2] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2017.
  • [3] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms—improving object detection with one line of code. In IEEE International Conference on Computer Vision. IEEE, 2017.
  • [4] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [5] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [6] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin. mmdetection. https://github.com/open-mmlab/mmdetection, 2018.
  • [7] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam. Masklab: Instance segmentation by refining object detection with semantic and direction features. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, (4):834–848, 2018.
  • [9] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information Processing Systems, 2017.
  • [10] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In European Conference on Computer Vision, 2016.
  • [11] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [12] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems, 2016.
  • [13] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, 2017.
  • [14] S. Gidaris and N. Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In IEEE International Conference on Computer Vision, 2015.
  • [15] S. Gidaris and N. Komodakis. Attend refine repeat: Active box proposal generation via in-out localization. In British Machine Vision Conference, 2016.
  • [16] R. Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.
  • [17] Z. Hayder, X. He, and M. Salzmann. Boundary-aware instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [18] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, 2017.
  • [19] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [20] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localization confidence for accurate object detection. In European Conference on Computer Vision, 2018.
  • [21] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua.

    A convolutional neural network cascade for face detection.

    In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [22] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [23] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [24] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, 2017.
  • [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • [26] S. Liu, J. Jia, S. Fidler, and R. Urtasun. Sgn: Sequential grouping networks for instance segmentation. In IEEE International Conference on Computer Vision, 2017.
  • [27] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [28] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 2016.
  • [29] W. Ouyang, K. Wang, X. Zhu, and X. Wang. Chained cascade network for object detection. 2017.
  • [30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems Workshop, 2017.
  • [31] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun. Megdet: A large mini-batch object detector. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [32] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters—improve semantic segmentation by global convolutional network. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017.
  • [33] P. O. Pinheiro, R. Collobert, and P. Dollár. Learning to segment object candidates. In Advances in Neural Information Processing Systems, 2015.
  • [34] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In European Conference on Computer Vision, 2016.
  • [35] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2015.
  • [37] S. Sun, J. Pang, J. Shi, S. Yi, and W. Ouyang. Fishnet: A versatile backbone for image, region, and pixel level prediction. In Advances in Neural Information Processing Systems, 2018.
  • [38] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin. Region proposal by guided anchoring. arXiv preprint arXiv:1901.03278, 2019.
  • [39] Z. Wu, C. Shen, and A. v. d. Hengel. Bridging category-level and instance-level semantic image segmentation. arXiv preprint arXiv:1605.06885, 2016.
  • [40] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [41] B. Yang, J. Yan, Z. Lei, and S. Li. Craft objects from images. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [42] Z. Zhang, S. Fidler, and R. Urtasun. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [43] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monocular object instance segmentation and depth ordering with cnns. In IEEE International Conference on Computer Vision, 2015.