The First Place Solution of Kaggle iMaterialist (Fashion) 2019 at FGVC6
Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4 Mask R-CNN baseline on MSCOCO dataset. More importantly, our overall system achieves 48.6 mask AP on the test-challenge dataset and 49.0 mask AP on test-dev, which are the state-of-the-art performance.READ FULL TEXT VIEW PDF
The First Place Solution of Kaggle iMaterialist (Fashion) 2019 at FGVC6
Instance segmentation is a fundamental computer vision task that performs per-pixel labeling of objects at instance level. Achieving accurate and robust instance segmentation in real-world scenarios such as autonomous driving and video surveillance is challenging. Firstly, visual objects are often subject to deformation, occlusion and scale changes. Secondly, background clutters make object instances hard to be isolated. To tackle these issues, we need a robust representation that is resilient to appearance variations. At the same time, it needs to capture rich contextual information for discriminating objects from cluttered background.
Cascade is a classic yet powerful architecture that has boosted performance on various tasks by multi-stage refinement. Cascade R-CNN  presented a multi-stage architecture for object detection and achieved promising results. The success of Cascade R-CNN can be ascribed to two key aspects: (1) progressive refinement of predictions and (2) adaptive handling of training distributions.
Though being effective on detection tasks, integrating the idea of cascade into instance segmentation is nontrivial. A direct combination of Cascade R-CNN and Mask R-CNN  only brings limited gain in terms of mask AP compared to bbox AP. Specifically, it improves bbox AP by but mask AP by , as shown in Table 1. An important reason for this large gap is the suboptimal information flow among mask branches of different stages. Mask branches in later stages only benefit from better localized bounding boxes, without direct connections.
To bridge this gap, we propose Hybrid Task Cascade (HTC), a new cascade architecture for instance segmentation. The key idea is to improve the information flow by incorporating cascade and multi-tasking at each stage and leverage spatial context to further boost the accuracy. Specifically, we design a cascaded pipeline for progressive refinement. At each stage, both bounding box regression and mask prediction are combined in a multi-tasking manner. Moreover, direct connections are introduced between the mask branches at different stages – the mask features of each stage will be embedded and fed to the next one, as demonstrated in Figure 2. The overall design strengthens the information flow between tasks and across stages, leading to better refinement at each stage and more accurate predictions on all tasks.
For object detection, the scene context also provides useful clues, e.g. for inferring the categories, scales, etc. To leverage this context, we incorporate a fully convolutional branch that performs pixel-level stuff segmentation. This branch encodes contextual information, not only from foreground instances but also from background regions, thus complementing the bounding boxes and instance masks. Our study shows that the use of the spatial contexts helps to learn more discriminative features.
Hybrid Task Cascade (HTC) is easy to implement and can be trained end-to-end. Without bells and whistles, it achieves and higher mask AP than Mask R-CNN and Cascade Mask R-CNN baselines respectively on the challenging COCO dataset. Together with better backbones and other common components, e.g. deformable convolution, multi-scale training and testing, model ensembling, we achieve mask AP on test-dev dataset, which is 2.3% higher than the winning approach  of COCO Challenge 2017.
Our main contributions are summarized as follows: (1) We propose Hybrid Task Cascade (HTC), which effectively integrates cascade into instance segmentation by interweaving detection and segmentation features together for a joint multi-stage processing. It achieves the state-of-the-art performance on COCO test-dev and test-challenge. (2) We demonstrate that spatial contexts benefit instance segmentation by discriminating foreground objects from background clutters. (3) We perform extensive study on various components and designs, which provides a reference and is helpful for futher research on object detection.
Instance segmentation is a task to localize objects of interest in an image at the pixel-level, where segmented objects are generally represented by masks. This task is closely related to both object detection and semantic segmentation. Hence, existing methods for this task roughly fall into two categories, namely detection-based and segmentation-based.
Detection-based methods resort to a conventional detector to generate bounding boxes or region proposals, and then predict the object masks within the bounding boxes. Many of these methods are based on CNN, including DeepMask, SharpMask , and InstanceFCN . MNC  formulates instance segmentation as a pipeline that consists of three sub-tasks: instance localization, mask prediction and object categorization, and trains the whole network end-to-end in a cascaded manner. In a recent work, FCIS  extends InstanceFCN and presents a fully convolutional approach for instance segmentation. Mask-RCNN  adds an extra branch based on Faster R-CNN  to obtain pixel-level mask predictions, which shows that a simple pipeline can yield promising results. PANet  adds a bottom-up path besides the top-down path in FPN  to facilitate the information flow. MaskLab  produces instance-aware masks by combining semantic and direction predictions.
Segmentation-based methods, on the contrary, first obtains a pixel-level segmentation map over the image, and then identifies object instances therefrom. Along this line, Zhang et al. [43, 42] propose to predict instance labels based on local patches and integrate the local results with an MRF. Arnab and Torr  also use CRF to identify instances. Bai and Urtasun 
propose an alternative way, which combines watershed transform and deep learning to produce an energy map, and then derive the instances by dividing the output of the watershed transform. Other approaches include bridging category-leval and instance-level segmentation, learning a boundary-aware mask representation 
, and employing a sequence of neural networks to deal with different sub-grouping problems.
The past several years have seen remarkable progress in object detection. Mainstream object detection frameworks are often categorized into two types, single-stage, e.g., SSD , YOLO , RetinaNet  and two-stage, e.g., Faster R-CNN , R-FCN , Mask R-CNN .
Recently, detection frameworks with multiple stages emerge as an increasingly popular paradigm for object detection. Multi-region CNN  incorporates an iterative localization mechanism that alternates between box scoring and location refinement. AttractioNet  introduces an Attend & Refine module to update bounding box locations iteratively. CRAFT  incorporates a cascade structure into RPN  and Fast R-CNN  to improve the quality of the proposal and detection results. IoU-Net  performs progressive bounding box refinement (even though not presenting a cascade structure explicitly). Cascade structures are also used to exclude easy negative samples. For example, CC-Net  rejects easy RoIs at shallow layers. Li et al.  propose to operate at multiple resolutions to reject simple samples. Among all the works that use cascade structures, Cascade R-CNN  is perhaps the most relevant to ours. Cascade R-CNN comprises multiple stages, where the output of each stage is fed into the next one for higher quality refinement. Moreover, the training data of each stage is sampled with increasing IoU thresholds, which inherently handles different training distributions.
While the proposed framework also adopts a cascade structure, it differs in several important aspects. First, multiple tasks, including detection, mask prediction, and semantic segmentation, are combined at each stage, thus forming a joint multi-stage processing pipeline. In this way, the refinement at each stage benefits from the reciprocal relations among these tasks. Moreover, contextual information is leveraged through an additional branch for stuff segmentation and a direction path is added to allow direct information flow across stages.
Cascade demonstrated its effectiveness on various tasks such as object detection . However, it is non-trivial to design a successful architecture for instance segmentation. In this work, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation.
Overview. In this work, we propose Hybrid Task Cascade (HTC), a new framework of instance segmentation. Compared to existing frameworks, it is distinctive in several aspects: (1) It interleaves bounding box regression and mask prediction instead of executing them in parallel. (2) It incorporates a direct path to reinforce the information flow between mask branches by feeding the mask features of the preceding stage to the current one. (3) It aims to explore more contextual information by adding an additional semantic segmentation branch and fusing it with box and mask branches. Overall, these changes to the framework architecture effectively improve the information flow, not only across stages but also between tasks.
We begin with a direct combination of Mask R-CNN and Cascade R-CNN, denoted as Cascade Mask R-CNN. Specifically, a mask branch following the architecture of Mask R-CNN is added to each stage of Cascade R-CNN, as shown in Figure 0(a). The pipeline is formulated as:
Here, indicates the CNN features of backbone network, and indicates box and mask features derived from and the input RoIs. is a pooling operator, e.g., RoI Align or ROI pooling, and denote the box and mask head at the -th stage, and represent the corresponding box predictions and mask predictions. By combining the advantages of cascaded refinement and the mutual benefits between bounding box and mask predictions, this design improves the box AP, compared to Mask R-CNN and Cascade R-CNN alone. However, the mask prediction performance remains unsatisfying.
One drawback of the above design is that the two branches at each stage are executed in parallel, both taking the bounding box predictions from the preceding stage as input. Consequently, the two branches are not directly interact within a stage. In response to this issue, we explore an improved design, which interleaves the box and mask branches, as illustrated in Figure 0(b). The interleaved execution can be expressed as:
In this way, the mask branch can take advantage of the updated bounding box predictions. We found that this yields improved performance.
In the design above, the mask prediction at each stage is based purely on the ROI features and the box prediction . There is no direct information flow between mask branches at different stages, which prevents further improvements on mask prediction accuracy. Towards a good design of mask information flow, we first recall the design of the cascaded box branches in Cascade R-CNN . An important point is the input feature of box branch is jointly determined by the output of the preceding stage and backbone. Following similar principles, we introduce an information flow between mask branches by feeding the mask features of the preceding stage to the current stage, as illustrated in Figure 0(c). With the direct path between mask branches, the pipeline can be written as:
where denotes the intermediate feature of and we use it as the mask representation of stage . is a function to combine the features of the current stage and the preceding one. This information flow makes it possible for progressive refinement of masks, instead of predicting masks on progressively refined bounding boxes.
Following the discussion above, we propose a simple implementation as below.
In this implementation, we adopt the RoI feature before the deconvolutional layer as the mask representation , whose spatial size is . At stage , we need to forward all preceding mask heads to compute .
Here, denotes the feature transformation component of the mask head , which is comprised of consecutive convolutional layers, as shown in Figure 2. The transformed features are then embedded with a convolutional layer in order to be aligned with the pooled backbone features . Finally, is added to
through element-wise sum. With this introduced bridge, adjacent mask branches are brought into direct interaction. Mask features in different stages are no longer isolated and all get supervised through backpropagation.
To further help distinguishing the foreground from the cluttered background, we use the spatial contexts as an effective cue. We add an additional branch to predict per-pixel semantic segmentation for the whole image, which adopts the fully convolutional architecture and is jointly trained with other branches, as shown in Figure 0(d). The semantic segmentation feature is a strong complement to existing box and mask features, thus we combine them together for better predictions:
indicates the semantic segmentation head. In the above formulation, the box and mask heads of each stage take not only the RoI features extracted from the backbone as input, but also exploit semantic features, which can be more discriminative on cluttered background.
Specifically, the semantic segmentation branch is constructed based on the output of the Feature Pyramid . Note that for semantic segmentation, the features at a single level may not be able to provide enough discriminative power. Hence, our design incorporates the features at multiple levels. In addition to the mid-level features, we also incorporate higher-level features with global information and lower-level features with local information for better feature representation.
Figure 3 shows the architecture of this branch. Each level of the feature pyramid is first aligned to a common representation space via a
convolutional layer. Then low level feature maps are upsampled, and high level feature maps are downsampled to the same spatial scale, where the stride is set to. We found empirically that this setting is sufficient for fine pixel-level predictions on the whole image. These transformed feature maps from different levels are subsequently fused by element-wise sum. Moreover, we add four convolutional layers thereon to further bridge the semantic gap. At the end, we simply adopt a convolutional layer to predict the pixel-wise segmentation map. Overall, we try to keep the design of semantic segmentation branch simple and straightforward. Though a more delicate structure can further improve the performance, It goes beyond our scope and we leave it for future work.
Fusing Contexts Feature into Main Framework. It is well known that joint training of closely related tasks can improve feature representation and bring performance gains to original tasks. Here, we propose to fuse the semantic features with box/mask features to allow more interaction between different branches. In this way, the semantic branch directly contributes to the prediction of bounding boxes and masks with the encoded spatial contexts. Following the standard practice, given a RoI, we use RoIAlign to extract a small (e.g., or ) feature patch from the corresponding level of feature pyramid outputs as the representation. At the same time, we also apply RoIAlign on the feature map of the semantic branch and obtain a feature patch of the same shape, and then combine the features from both branches by element-wise sum.
Since all the modules described above are differentiable, Hybrid Task Cascade (HTC) can be trained in an end-to-end manner. At each stage , the box head predicts the classification score and regression offset for all sampled RoIs. The mask head predicts pixel-wise masks for positive RoIs. The semantic branch predicts a full image semantic segmentation map
. The overall loss function takes the form of a multi-task learning:
Here, is the loss of the bounding box predictions at stage , which follows the same definition as in Cascade R-CNN  and combines two terms and , respectively for classification and bounding box regression. is the loss of mask prediction at stage , which adopts the binary cross entropy form as in Mask R-CNN . is the semantic segmentation loss in the form of cross entropy. The coefficients and
are used to balance the contributions of different stages and tasks. We follow the hyperparameter settings in Cascade R-CNN. Unless otherwise noted, we set , and by default.
|Method||Backbone||box AP||mask AP||runtime (fps)|
|Mask R-CNN ||ResNet-50-FPN||39.1||35.6||57.6||38.1||18.7||38.3||46.6||5.3|
|Cascade Mask R-CNN||ResNet-50-FPN||42.7||36.9||58.6||39.7||19.6||39.3||48.8||3.0|
|Cascade Mask R-CNN||ResNet-101-FPN||44.4||38.4||60.2||41.4||20.2||41.0||50.6||2.9|
|Cascade Mask R-CNN||ResNeXt-101-FPN||46.6||40.1||62.7||43.4||22.0||42.8||52.9||2.5|
Datasets. We perform experiments on the challenging instance segmentation dataset: COCO dataset . We train our models on the split of 2017train (115k images) and report results on 2017val and 2017test-dev. Typical instance annotations are used to supervise box and mask branches, and the semantic branch is supervised by COCO-stuff  annotations.
Evaluation Metrics. We report the standard COCO-style Average Precision (AP) metrics which average APs across IoU thresholds from to with an interval of . Both box AP and mask AP are evaluated. For mask AP, we also report , (AP at different IoU thresholds) and , , (AP at different scales). Runtime is measured on a single TITAN Xp GPU.
In all experiments, we adopt a 3-stage cascade and FPN is used in all backbones. For fair comparison, Mask R-CNN and Cascade R-CNN are reimplemented with PyTorch and mmdetection , which are slightly higher than the reported performance in the original papers. We train detectors with 16 GPUs (one image per GPU) for epochs with an initial learning rate of , and decrease it by after and epochs, respectively. The long edge and short edge of images are resized to and respectively without changing the aspect ratio.
During inference, object proposals are refined progressively by box heads of different stages. Classification scores of multiple stages are ensembled as in Cascade R-CNN. Mask branches are only applied to detection boxes with higher scores than a threshold and are also ensembled.
We compare Hybrid Task Cascade with the state-of-the-art instance segmentation approaches on the COCO dataset in Table 1. We also evaluate Cascade Mask R-CNN as a strong baseline of our method, which is described in Section 1. Compared to Mask R-CNN, the naive cascaded baseline brings and gain in terms of box AP and mask AP, respectively. It is noted that this baseline is already higher than PANet , the state-of-the-art instance segmentation method. Our HTC achieves consistent improvements on different backbones, proving its effectiveness. It achieves a gain of , and for ResNet-50-FPN, ResNet-101-FPN and ResNeXt-101-FPN, respectively.
Component-wise Analysis. Firstly, we investigate the effects of two main components in our framework. “Interleaved” denotes the interleaved execution of bbox and mask branches, “Mask” indicates the introduction of mask branch information flow and “Semantic” means adding the semantic segmentation branch. From Table 2, we can learn that the interleaved execution slightly improves the mask AP by . The mask information flow contributes to a further improvement, and semantic segmentation branch leads to a gain of .
|Cascade||Interleaved||Mask Info||Semantic||box AP||mask AP|
Effectiveness of Interleaved Branch Execution. In Section 3.1, we designed the interleaved branch execution to benefit the mask branch from updated bounding boxes. To investigate the effeciveness of this strategy, we compare it with the conventional parallel execution pipeline on both Mask R-CNN and Cascade Mask R-CNN. As shown in Table 3, interleaved execution performs better than parallel execution on both methods, with and improvements respectively.
|Method||execution||box AP||mask AP|
|Cascade Mask R-CNN||parallel||42.5||36.5||57.9||39.4||18.9||39.5||50.8|
Effectiveness of Mask Information Flow. We study how the introduced mask information flow helps mask prediction by comparing stage-wise performance. Semantic segmentation branch is not added to exclude possible distraction. From Table 4, we find that introducing the mask information flow greatly improves the the mask AP in the second stages. Without direct connections between mask branches, the second stage only benefits from better localized bounding boxes, so the improvement is limited (). With the mask information flow, the gain is more significant (), because it makes each stage aware of the preceding stage’s features. Similar to Cascade R-CNN, stage 3 does not outperforms stage 2, but it contributes to ensembled performance.
Effectiveness of Semantic Feature Fusion. We exploit contextual features by introducing a semantic segmentation branch and fuse the features of different branches. Multi-task learning is known to be beneficial, here we study the necessity of semantic feature fusion. We train different models that fuse semantic features with box or mask or both branches, and the results are shown in Table 5. It is noted that simply adding a full image segmentation task achieves improvement, mainly resulting from additional supervision. Feature fusion also contributes to further gains,e.g., fusing the semantic features with both the box and mask branches brings an extra gain, which indicates that complementary information increases feature discrimination for box and mask branches.
Influence of Loss Weight. A new hyperparameter is introduced in HTC, since we involve one more task for joint training. We tested different loss weight for the semantic branch, as shown in Table 6. Results show that our method is not sensitive to the loss weight hyperparameter.
With the proposed HTC, we achieve mask AP and absolute improvement compared to the winning entry last year. Here we list all the tricks and additional modules used to obtain the performance. The step-by-step gains brought by each component are illustrated in Table 7.
HTC Baseline. The ResNet-50 baseline achieves mask AP.
DCN. We adopt deformable convolution  in the last stage of backbones.
Multi-scale Training. We apply multi-scale training. In each iteration, the scale of short edge is randomly sampled from , and the scale of long edge is fixed as .
SENet-154. We tried different backbones besides ResNet-50, and SENet-154  achieves best single model performance among them.
GA-RPN. We finetune trained detectors with the proposals generated by a guided anchoring scheme combined with RPN (GA-RPN) , which achieves near 10% higher recall than RPN.
Multi-scale Testing. We use 5 scales as well as horizontal flip at test time and ensemble the results. Testing scales are (600, 900)， (800, 1200), (1000, 1500), (1200, 1800), (1400, 2100).
|2017 winner ||46.7||69.5||51.3||26.0||49.1||64.0|
|+ ms train||42.5||64.8||46.4||23.7||45.3||56.7|
|+ ms test||47.4||70.6||52.1||30.2||50.1||61.8|
We also perform extensive study on some components designed for detection and segmentation. Components are often compared under different conditions such as backbones, codebase, etc. Here we provide a unified environment with state-of-the-art object detection and instance segmentation framework to investigate the functionality of extensive components. We integrate several common modules designed for detection and segmentation and evaluate them under the same settings, and the results are shown in Table 8. Limited by our experience and resources, some implementations and the integration may not be optimal and worth further study. Code will be released as a benchmark to test more components.
ASPP. We adopt Atrous Spatial Pyramid Pooling (ASPP)  module from the semantic segmentation community to capture more image context at multiple scales. We append an ASPP module after FPN.
PAFPN. We test the PAFPN module from PANet . The difference from the original implementation is that we do not use Synchronized BatchNorm.
GCN. We adopt Global Convolutional Network (GCN)  in the semantic segmentation branch.
PreciseRoIPooling. We replace the RoI align layers in HTC with Precise RoI Pooling .
SoftNMS. We use SoftNMS  instead NMS for box heads.
We propose Hybrid Task Cascade (HTC), a new cascade architecture for instance segmentation. It interweaves box and mask branches for a joint multi-stage processing, and adopts a semantic segmentation branch to provide spatial context. This framework progressively refines mask predictions and integrates complementary features together in each stage. Without bells and whistles, the proposed method obtains improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Notably, our overall system achieves mask AP on the test-challenge dataset and mask AP on test-dev.
IEEE Conference on Computer Vision and Pattern Recognition, 2017.