Instance segmentation is a fundamental yet challenging task in computer vision, which requires an algorithm to predict a per-pixel mask with a category label for each instance of interest in an image. Panoptic segmentation further requires the algorithm to segment the stuff (e.g., sky and grass), assigning every pixel in the image a semantic label. Panoptic segmentation is often built on an instance segmentation framework with an extra semantic segmentation branch. Therefore, both instance and panoptic segmentation share the same key challenge—-how to efficiently and effectively distinguish individual instances.
Despite a few works being proposed recently, the dominant method tackling this challenge is still the two-stage method as in Mask R-CNN , which casts instance segmentation into a two-stage detection-and-segmentation task. To be specific, Mask R-CNN first employs an object detector Faster R-CNN to predict a bounding-box for each instance. Then for each instance, regions-of-interest (ROIs) are cropped from the networks’ feature maps using the ROIAlign operation. To predict the final masks for each instance, a compact fully convolutional network (FCN) (i.e., mask head) is applied to these ROIs to perform foreground/background segmentation. However, this ROI-based method may have the following drawbacks. 1) Since ROIs are often axis-aligned bounding-boxes, for objects with irregular shapes, they may contain an excessive amount of irrelevant image content including background and other instances. This issue may be mitigated by using rotated ROIs, but with the price of a more complex pipeline. 2) In order to distinguish between the foreground instance and the background stuff or instance(s), the mask head requires a relatively larger receptive field to encode sufficiently large context information. As a result, a stack of convolutions is needed in the mask head (e.g., four convolutions with
channels in Mask R-CNN). It considerably increases computational complexity of the mask head, resulting that the inference time significantly varies in the number of instances. 3) ROIs are typically of different sizes. In order to use effective batched computation in modern deep learning frameworks[1, 33], a resizing operation is often required to resize the cropped regions into patches of the same size. For instance, Mask R-CNN resizes all the cropped regions to (upsampled to using a deconvolution), which restricts the output resolution of instance segmentation, as large instances would require higher resolutions to retain details at the boundary.
. FCNs also have shown excellent performance on many other per-pixel prediction tasks ranging from low-level image processing such as denoising, super-resolution; to mid-level tasks such as optical flow estimation and contour detection; and high-level tasks including recent single-shot object detection, monocular depth estimation [29, 53, 54, 2] and counting . However, almost all the instance segmentation methods based on FCNs111By FCNs, we mean the vanilla FCNs in  that only involve convolutions and pooling. lag behind state-of-the-art ROI-based methods. Why do the versatile FCNs perform unsatisfactorily on instance segmentation? This is due to the fact that the FCNs tend to yield similar predictions for similar image appearance. As a result, the vanilla FCNs are incapable of distinguishing individual instances. For example, if two persons A and B with the similar appearance are in an input image, when predicting the instance mask of A, the FCN needs to predict B as background w.r.t. A, which can be difficult as they look similar in appearance. Therefore, an ROI operation is used to crop the person of interest, i.e., A; and filter out B. Essentially, this is the core operation making the model attend to an instance. Instead of using ROIs, CondInst attends to each instance by using instance-sensitive convolution filters as well as relative coordinates that are appended to the feature maps.
Specifically, unlike Mask R-CNN, which uses a standard convolution network with a fixed set of convolutional filters as the mask head for predicting all instances, the network parameters are adapted according to the instance to be predicted. Inspired by dynamic filtering networks  and CondConv , for each instance, a controller sub-network (see Fig. 3) dynamically generates the mask FCN network parameters (conditioned on the center area of the instance), which is then used to predict the mask of this instance. It is expected that the network parameters can encode the characteristics (e.g., relative position, shape and appearance) of this instance, and only fires on the pixels of this instance, which thus bypasses the difficulty mentioned above. These conditional mask heads are applied to the whole feature maps, eliminating the need for ROI operations. At the first glance, the idea may not work well as instance-wise mask heads may incur a large number of network parameters provided that some images contain as many as dozens of instances. However, as the mask head filters are only asked to predict the mask of only one instance, it largely eases the learning requirement and thus reduces the load of the filters. As a result, the mask head can be extremely light-weight. We will show that a very compact mask head with dynamically-generated filters can already outperform previous ROI-based Mask R-CNN. This compact mask head also results in much reduced computational complexity per instance than that of the mask head in Mask R-CNN.
We summarize our main contributions as follow.
We attempt to solve instance segmentation from a new perspective that uses dynamic mask heads. This novel solution achieves improved instance segmentation performance than existing methods such as Mask R-CNN. To our knowledge, this is the first time that a new instance segmentation framework outperforms recent state-of-the-art both in accuracy and speed.
CondInst is fully convolutional and avoids the aforementioned resizing operation used in many existing methods, as CondInst does not rely on ROI operations. Without having to resize feature maps leads to high-resolution instance masks with more accurate edges, as shown in Fig. 2.
Since the mask head in CondInst is very compact, compared with the box detector FCOS, CondInst needs only 10% more computational time to obtain the mask results, even when processing the maximum number of instances per image (i.e., instances). The overall inference time is also stable as it does not depend on the number of instances in the image.
With an extra semantic segmentation branch, CondInst can be easily extended to panoptic segmentation , resulting a unified fully convolutional network for both instance and panoptic segmentation tasks.
CondInst achieves state-of-the-art performance on both instance and panoptic segmentation tasks while being fast and simple. We hope that CondInst can be a new strong alternative for instance and panoptic segmentation tasks, as well as other instance-level recognition tasks such as keypoint detection.
1.1 Related Work
Here we review some work that is most relevant to ours.
Conditional Convolutions. Unlike traditional convolutional layers, which have fixed filters once trained, the filters of conditional convolutions are conditioned on the input and are dynamically generated by another network (i.e., a controller). This idea has been explored previously in dynamic filter networks  and CondConv  mainly for the purpose of increasing the capacity of a classification network. In this work, we extend this idea to solve the significantly more challenging task of instance segmentation.
Instance Segmentation. To date, the dominant framework for instance segmentation is still Mask R-CNN. Mask R-CNN first employs an object detector to detect the bounding-boxes of instances (e.g., ROIs). With these bounding-boxes, an ROI operation is used to crop the features of the instance from the feature maps. Finally, a compact FCN head is used to obtain the desired instance masks. Many works [7, 30, 20] with top performance are built on Mask R-CNN. Moreover, some works have explored to apply FCNs to instance segmentation. InstanceFCN  may be the first instance segmentation method that is fully convolutional. InstanceFCN proposes to predict position-sensitive score maps with vanilla FCNs. Afterwards, these score maps are assembled to obtain the desired instance masks. Note that InstanceFCN does not work well with overlapping instances. Others [35, 36, 15] attempt to first perform segmentation and the desired instance masks are formed by assembling the pixels of the same instance. Novotny et al.  propose semi-convolutional operators to make FCNs applicable to instance segmentation. To our knowledge, thus far none of these methods can outperform Mask R-CNN both in accuracy and speed on the COCO benchmark dataset.
The recent YOLACT  and BlendMask  may be viewed as a reformulation of Mask RCNN, which decouples ROI detection and feature maps used for mask prediction. Wang et al. developed a simple FCN based instance segmentation method, showing competitive performance . PolarMask developed a new simple mask representation for instance segmentation , which extends the bounding box detector FCOS .
Panoptic segmentation. There are two main approaches for solving this task. The first one is the bottom-up approach. It tackles the task as a semantic segmentation at first and then uses clustering/grouping methods to assemble the pixels into individual instances or stuff .
The second approach is the top-down approach, which is often built on top-down instance segmentation methods. Panoptic-FPN  extends an additional semantic segmentation branch from Mask R-CNN and combines the results with the instance segmentation results generated by Mask R-CNN . Moreover, attention based methods recently gain much popularity in many computer vision tasks, which provide a new approach to panoptic segmentation. Axial-DeepLab  used a carefully designed module to enable attention to be applied to large-size images for panoptic segmentation. CondInst can easily be applied to panoptic segmentation following the top-down approaches. we empirically observe that the quality of the instance segmentation results may be the dominant factor to the final performance. Thus in CondInst, without bells and whistles, by simply applying the same method used by Panoptic-FPN, the panoptic segmentation performance of CondInst is already competitive compared to the state-of-the-art panoptic segmentation methods.
. The idea shares some similarity with CondInst in that information about an instance is encoded in the coefficients generated by FiLM. Since only the batch normalization coefficients are dynamically generated, AdaptIS needs a large mask head to achieve good performance. In contrast, CondInst directly encodes them into conv. filters of the mask head, which is much more straightforward and efficient. Also, as shown in experiments, CondInst can achieve much better panoptic segmentation accuracy than AdaptIS, which suggests that CondInst is much more effective.
2 Our Methods: Instance and Panoptic Segmentation with CondInst
We first present CondInst for instance segmentation. The instance segmentation framework can be extended to solve panoptic segmentation in Sec. 2.5.
2.1 Overall Architecture for Instance Segmentation
Given an input image , the goal of instance segmentation is to predict the pixel-level mask and the category of each instance of interest in the image. The ground-truths are defined as , where is the mask for the -th instance and is the category. is on MS-COCO . In semantic segmentation, the prediction target of each pixel are well-defined, which is the semantic category of the pixel. In addition, the number of categories is known and fixed. Thus, the outputs of semantic segmentation can be easily represented with the output feature maps of the FCNs, and each channel of the output feature maps corresponds to a class. However, in instance segmentation, the prediction target of each pixel is hard to define because instance segmentation also requires to distinguish individual instances and the number of instances varies in an image. This poses a major challenge when applying traditional FCNs  to instance segmentation.
In this work, our core idea is that for an image with instances, different mask heads will be dynamically generated, and each mask head will contain the characteristics of its target instance in their filters. As a result, when the mask is applied to an input, it will only fire on the pixels of the instance, thus producing the mask prediction of the instance and distinguishing individual instances. We illustrate the process in Fig. 1. The instance-aware filters are generated by modifying an object detector. Specifically, we add a new controller branch to generate the filters for the target instance of each box predicted by the detector, as shown in Fig. 3. Therefore, the number of the dynamic mask heads is the same as the number of the predicted boxes, which should be the number of the instances in the image if the detector works well. In this work, we build CondInst on the popular object detector FCOS  due to its simplicity and flexibility. Also, the elimination of anchor-boxes in FCOS can also save the number of parameters and the amount of computation.
As shown in Fig. 3, following FCOS , we make use of the feature maps of feature pyramid networks (FPNs) , whose down-sampling ratios are , , , and , respectively. As shown in Fig. 3, on each feature level of the FPN, some functional layers (in the dash box) are applied to make instance-aware predictions. For example, the class of the target instance and the dynamically-generated filters for the instance. In this sense, CondInst can be viewed as the same as Mask R-CNN, both of which first attend to instances in an image and then predict the pixel-level masks of the instances (i.e., instance-first).
Moreover, recall that Mask R-CNN employs an object detector to predict the bounding-boxes of the instances in the input image. The bounding-boxes are actually the way that Mask R-CNN represents instances. Similarly, CondInst employs the instance-aware filters to represent the instances. In other words, instead of encoding the instance information with the bounding-boxes, CondInst implicitly encodes it with the parameters of the dynamic mask heads, which is much more flexible. For example, it can easily represent the irregular shapes that are hard to be tightly enclosed by a bounding-box. This is one of CondInst’s advantages over the previous ROI-based methods.
Besides the detector, as shown in Fig. 3, there is also a bottom branch, which provides the feature maps (denoted by ) that our generated mask heads take as inputs to predict the desired instance mask. The bottom branch aggregates the FPN feature maps , and . To be specific, and are upsampled to the resolution of with bilinear and added to . After that, four convolutions with channels are applied. The resolution of the resulting feature maps is the same as (i.e., of the input image resolution). Finally, another convolutional layer is used to reduce the number of the output channels from to , which reduces the number of the generated parameters. Surprisingly, can already achieve good performance, and as shown in our experiments, a larger here (e.g., 16) cannot improve the performance. Even more aggressively, using only degrades the performance by in mask AP. is shared between all the instances.
Moreover, as mentioned before, the generated filters also encode the shape and position of the target instance. Since the feature maps do not generally convey the position information, a map of the coordinates needs to be appended to such that the generated filters are aware of positions. As the filters are generated with the location-agnostic convolutions, they can only (implicitly) encode the shape and position with the coordinates relative to the location where the filters are generated (i.e., using the coordinate system with the location as the origin). Thus, as shown in Fig. 3, is combined with a map of the relative coordinates, which are obtained by transforming all the locations on
to the coordinate system with the location generating the filters as the origin. Then, the combination is sent to the mask head to predict the instance mask in the fully convolutional fashion. The relative coordinates provide a strong cue for predicting the instance mask, as shown in our experiments. It is also interesting to note that even if the generated mask heads only take as input the map of the relative coordinates, a modest performance can be obtained as shown in the experiments. This empirically proves that the generated filters indeed encode the shape and position of the target instance. Finally, sigmoid is used as the last layer of the mask head and obtains the mask scores. The mask head only classifies the pixels as the foreground or background. The class of the instance is predicted by the classification head in the detector, as shown in Fig.3.
The resolution of the original mask prediction is same as the resolution of , which is of the input image resolution. In order to improve the resolution of instance masks, we use bilinear to upsample the mask prediction by , resulting in instance masks (if the input image size is ). The mask’s resolution is much higher than that of Mask R-CNN (only as mentioned before).
2.2 Network Outputs and Training Targets
Similar to FCOS, each location on the FPN’s feature maps either is associated with an instance, thus being a positive sample, or is considered as a negative sample. The associated instance and label for each location are determined as follows.
Let us consider the feature maps and let be its down-sampling ratio. As shown in previous works [43, 39, 18], a location on the feature maps can be mapped back onto the input image as . If the mapped location falls in the center region of an instance, the location is considered to be responsible for the instance. Any locations outside the center regions are labeled as negative samples. The center region is defined as the box , where denotes the mass center of the instance mask, is the down-sampling ratio of and is a constant scalar being as in FCOS . As shown in Fig. 3, at a location on , CondInst has the following output heads.
Classification Head. The classification head predicts the class of the instance associated with the location. The ground-truth target is the instance’s class or (i.e., background). As in FCOS, the network predicts a
-D vectorfor the classification and each element in corresponds to a binary classifier, where is the number of categories.
Controller Head. The controller head, which has the same architecture as the classification head, is used to predict the parameters of the conv. filters of the mask head for the instance at the location. The mask head predicts the mask for this particular instance. This is the core contribution of our work. To predict the parameters, we concatenate all the parameters of the filters (i.e., weights and biases) together as an -D vector , where is the total number of the parameters. Accordingly, the controller head has output channels. The mask head is a very compact FCN architecture, which has three convolutions, each having channels and using ReLU as the activation function except for the last one. No normalization layer such as batch normalization is used here. The last layer has output channel and uses sigmoid to predict the probability of being foreground. The mask head has parameters in total ( and ). The masks predicted by the mask heads are trained with the ground-truth instance masks, which pushes the controller to generate the correct filters.
Box Head. The box head is the same as that in FCOS, which predicts a 4-D vector encoding the four distances from the location to the four boundaries of the bounding-box of the target instance. Conceptually, CondInst can eliminate the box head since CondInst needs no ROIs. However, we find that if we make use of box-based NMS, the inference time will be much reduced. Thus, we still predict boxes in CondInst. We would like to highlight that the predicted boxes are only used in NMS and do not involve any ROI operations. Moreover, as shown in Table V, the box prediction can be removed if no box information is used (e.g., mask NMS ). This is fundamentally different from previous ROI-based methods, in which the box prediction is mandatory.
Center-ness Head. Like FCOS, at each location, we also predict a center-ness score. The center-ness score depicts how the location deviates from the center of the target instance. In inference, it is used to down-weight the boxes predicted by the locations far from the center, which might be unreliable. We refer readers to FCOS  for the details.
2.3 Loss Function
Formally, the overall loss function of CondInst can be formulated as,
where and denote the original loss of FCOS and the loss for instance masks, respectively. being in this work is used to balance the two losses. is the same as in FCOS. Specifically, includes the classification head, the box regression head and the center-ness head, which are trained with the focal loss , the GIoU loss, and the binary cross-entropy (BCE) loss, respectively. is defined as,
where is the classification label of location , which is the class of the instance associated with the location or (i.e., background) if the location is not associated with any instance. is the number of locations where . is the indicator function, being if and otherwise. is the generated filters’ parameters at location . is the combination of and a map of coordinates . As described before, is the relative coordinates from all the locations on to (i.e., the location where the filters are generated). denotes the mask head, which consists of a stack of convolutions with dynamic parameters . is the mask of the instance associated with location .
Here is the Dice loss as in , which is used to overcome the foreground-background sample imbalance. We do not employ focal loss here as it requires special initialization, which is not trivial if the parameters are dynamically generated. Note that, in order to compute the loss between the predicted mask and the ground-truth mask , they are required to have the same size. As mentioned before, the resolution of the predicted mask is of the ground-truth mask . Thus, we down-sample by to make the sizes equal. These operations are omitted in Eq. (2) for clarification.
By design, all the positive locations on the feature maps should be used to compute the mask loss. For the images having hundreds of positive locations, the model would consume a large amount of memory. Therefore, in our preliminary version , the positive locations used in computing the mask loss are limited up to 500 per GPU (i.e., 250 per image and we have two images on one GPU). If there are more than 500 positive locations, 500 locations will be randomly chosen. In this version, instead of randomly choosing the 500 locations, we first rank the locations by the scores predicted by the FCOS detector, and then choose the locations with top scores for each instance. As a result, the number of locations per image can be reduced to . This strategy works equally well and further reduces the memory footprint. For instance, using this strategy, the ResNet-50 based CondInst can be trained with 4 1080Ti GPUs.
Moreover, as shown in YOLACT  and BlendMask , during training, the instance segmentation task can benefit from a joint semantic segmentation task. Thus, we also conduct experiments with the joint semantic segmentation task, showing improved performance. However, unless explicitly specified, all the experiments in the paper are without the semantic segmentation task. If used, the semantic segmentation loss is added to .
Given an input image, we forward it through the network to obtain the outputs including classification confidence , center-ness scores, box prediction and the generated parameters . We first follow the steps in FCOS to obtain the box detections. Afterwards, box-based NMS with the threshold being is used to remove duplicated detections and then the top boxes are used to compute masks. Note that these boxes are also associated with the filters generated by the controller. Let us assume that boxes remain after the NMS, and thus we have groups of the generated filters. The groups of filters are used to produce instance-specific mask heads. These instance-specific mask heads are applied, in the fashion of FCNs, to the (i.e., the combination of and ) to predict the masks of the instances. Since the mask head is a very compact network (three convolutions with channels and parameters in total), the overhead of computing masks is extremely small. For example, even with detections (i.e., the maximum number of detections per image on MS-COCO), only less milliseconds in total are spent on the mask heads, which only adds computational time to the base detector FCOS. In contrast, the mask head of Mask R-CNN has four convolutions with channels, thus having more than 2.3M parameters and taking longer computational time.
2.5 Extension to Panoptic Segmentation
Since panoptic segmentation can be treated as a combination of instance and semantic segmentation, we keep CondInst as is to obtain the instance segmentation results. For the semantic segmentation, we use the structure from Panoptic-FPN . To be specific, as shown in Fig. 4, the semantic segmentation branch takes as inputs the feature maps of FPNs. are up-sampled to the same resolution as and the four feature maps are concatenated together. The resolution of is of the input image, which is also the same as the instance masks predicted by CondInst. Then, it is followed by a convolution and to obtain the classification scores. The classification scores are trained with the cross-entropy loss and the loss is added to with loss weight .
Inference. During inference, we follow  to combine instance and semantic results for obtaining the final panoptic results. The instance results from CondInst are ranked by their confidence scores generated by FCOS. The results with their scores less than are discarded. When overlaps occur between the instance masks, the overlap areas are attributed to the instance with higher score. Then, the instance that loses more than 40% of its total area due to its overlap with other higher-score-instances is discarded. Finally, the semantic results are filled to the areas that are not occupied by any instance.
We evaluate CondInst on the large-scale benchmark MS-COCO . Following the common practice [17, 43, 27], our models are trained with split train2017 (115K images) and all the ablation experiments are evaluated on split val2017 (5K images). Our main results are reported on the test-dev split (20K images).
3.1 Implementation Details
Unless specified, we make use of the following implementation details. Following FCOS 
, ResNet-50 is used as our backbone network and the weights pre-trained on ImageNet are used to initialize it. For the newly added layers, we initialize them as in 
. Our models are trained with stochastic gradient descent (SGD) overV100 GPUs for 90K iterations with the initial learning rate being and a mini-batch of images. The learning rate is reduced by a factor of at iteration and , respectively. Weight decay and momentum are set as and , respectively. Following Detectron2 , the input images are resized to have their shorter sides in and their longer sides less or equal to during training. Left-right flipping data augmentation is also used during training. When testing, we do not use any data augmentation and only the scale of the shorter side being is used. The inference time in this work is measured on a single V100 GPU with image per batch.
|w/ abs. coord.||w/ rel. coord.||w/||AP||AP||AP||AP||AP||AP||AR||AR||AR|
3.2 Architectures of the Mask Head
In this section, we discuss the design choices of the mask head in CondInst. To our surprise, the performance is not sensitive to the architectures of the mask head. Our baseline is the mask head of three convolutions with channels (i.e., width ). As shown in Table I (3rd row), it achieves in mask AP. Next, we first conduct experiments by varying the depth of the mask head. As shown in Table (a)a, apart from the mask head with depth being , all other mask heads (i.e., depth and ) attain similar performance. The mask head with depth being achieves inferior performance as in this case the mask head is actually a linear mapping, which has overly weak capacity and cannot encode the complex shapes of the instances. Moreover, as shown in Table (b)b, varying the width (i.e., the number of the channels) does not result in a remarkable performance change either as long as the width is in a reasonable range. We also note that our mask head is extremely light-weight as the filters in our mask head are dynamically generated. As shown in Table I, our baseline mask head only takes ms per instances (the maximum number of instances on MS-COCO), which suggests that our mask head only adds small computational overhead to the base detector. Moreover, our baseline mask head only has parameters in total. In sharp contrast, the mask head of Mask R-CNN  has more than 2.3M parameters and takes computational time ( ms per instances).
3.3 Design Choices of the Mask Branch
We further investigate the impact of the mask branch. We first change , which is the number of channels of the mask branch’s output feature maps (i.e., ). As shown in Table II, as long as is in a reasonable range (i.e., from to ), the performance keeps almost the same. is optimal and thus we use in all other experiments by default.
As mentioned before, before taken as the input of the mask heads, the mask branch’s output is concatenated with a map of relative coordinates, which provides a strong cue for the mask prediction. As shown in Table III (2nd row), the performance drops significantly if the relative coordinates are removed ( vs. ). The significant performance drop implies that the generated filters not only encode the appearance cues but also encode the shape and relative position of the target instance. It can also be evidenced by the experiment only using the relative coordinates. As shown in Table III (2rd row), only using the relative coordinates can also obtain decent performance ( in mask AP). We would like to highlight that unlike Mask R-CNN, which is based on Faster R-CNN and represents the target instance by an axis-aligned RoI, CondInst implicitly encodes the shape into the generated filters, which can easily represent any shapes including irregular ones and thus is much more flexible. We also experiment with the absolute coordinates, but it cannot largely boost the performance as shown in Table III (). This suggests that the generated filters mainly carry translation-invariant cues such as shapes and relative position, which is preferable.
3.4 How Important to Upsample Mask Predictions?
As mentioned before, the original mask prediction is upsampled and the upsampling is of great importance to the final performance. We confirm this in the experiment. As shown in Table IV, without using the upsampling (1st row in the table), in this case CondInst can produce the mask prediction with of the input image resolution, which merely achieves in mask AP because most of the details (e.g., the boundary) are lost. If the mask prediction is upsampled by factor , the performance can be significantly improved by in mask AP (from to ). In particular, the improvement on small objects is large (from to ), which suggests that the upsampling can greatly retain the details of objects. Increasing the upsampling factor to slightly worsens the performance in some metrics (e.g., AP), probably due to the relatively low-quality annotations of MS-COCO. Therefore, we use factor in all other models.
3.5 CondInst without Bounding-box Detection
Although we still keep the bounding-box detection branch in CondInst, it is conceptually feasible to totally eliminate it if we make use of the NMS using no bounding-boxes. In this case, all the foreground samples (determined by the classification head) will be used to compute instance masks, and the duplicated masks will be removed by mask-based NMS. This is confirmed in Table V. As shown in the table, with the mask-based NMS, the same overall performance can be obtained as box-based NMS ( vs. in mask AP).
3.6 Comparisons with State-of-the-art Methods
We compare CondInst against previous state-of-the-art methods on MS-COCO test-dev split. As shown in Table VI, with learning rate schedule (i.e., iterations), CondInst outperforms the original Mask R-CNN by ( vs. ). CondInst also achieves a much faster speed than the original Mask R-CNN (ms vs. ms per image on a single V100 GPU). To our knowledge, it is the first time that a new and simpler instance segmentation method, without any bells and whistles outperforms Mask R-CNN both in accuracy and speed. CondInst also obtains better performance ( vs. ) and on-par speed (ms vs ms) than the well-engineered Mask R-CNN in Detectron2 (i.e., Mask R-CNN in Table VI). Furthermore, with a longer training schedule (e.g., ) or a stronger backbone (e.g., ResNet-101), a consistent improvement is achieved as well ( vs. with ResNet-50 and vs. with ResNet-101 ). Moreover, as shown in Table VI, with the auxiliary semantic segmentation task, the performance can be boosted from to (ResNet-50) or from to (ResNet-101), without increasing the inference time. For fair comparisons, all the inference time here is measured by ourselves on the same hardware with the official codes.
|Mask R-CNN ||R-50-FPN||34.6||56.5||36.6||15.4||36.3||49.7|
|BlendMask w/ sem. ||R-50-FPN||✓||37.0||58.9||39.7||17.3||39.4||52.5|
|CondInst w/ sem.||R-50-FPN||✓||38.6||60.2||41.4||20.6||41.0||51.1|
|BlendMask w/ sem.||R-101-FPN||✓||39.6||61.6||42.6||22.4||42.2||51.4|
|CondInst w/ sem.||R-101-FPN||✓||40.0||62.0||42.9||21.4||42.6||53.0|
|CondInst w/ sem.||R-101-BiFPN||✓||40.5||62.4||43.4||21.8||43.3||53.3|
|CondInst w/ sem.||R-101-DCN-BiFPN||✓||41.3||63.3||44.4||22.5||43.9||55.2|
We also compare CondInst with the recently-proposed instance segmentation methods. Only with half training iterations, CondInst surpasses TensorMask  by a large margin ( vs. for ResNet-50 and vs. for ResNet-101). CondInst is also faster than TensorMask (ms vs ms per image on the same GPU) with similar performance ( vs. ). Moreover, CondInst outperforms YOLACT-700  by a large margin with the same backbone ResNet-101 ( vs. and both with the auxiliary semantic segmentation task). Moreover, as shown in Fig. 2, compared with YOLACT-700 and Mask R-CNN, CondInst can preserve more details and produce higher-quality instance segmentation results.
3.7 Real-time Instance Segmentation with CondInst
We also present a real-time version of CondInst. Following FCOS , the conv. layers in the classification and box regression towers in FCOS are shared in the real-time models (denoted by “shtw.” in Table VII). Moreover, we reduce the size of the input image from a scale of 800 to 512 during testing, and the FPN levels and are removed since there are not many larger objects with the small input images. In order to compensate for the performance loss due to the smaller input size, we use a more aggressive training strategy here. Specifically, the real-time models are trained for iterations (i.e., ) and the shorter side of the input image is randomly chosen from the range 256 to 608 with step 32. The synchronized BatchNorm (SyncBN) is also used during training. In the real-time models, following YOLACT, we enable the extra semantic segmentation loss by default.
The performance and inference speed of these real-time models are shown in Table VII. As shown in the table, the R-50 based CondInst-RT outperforms the R-50 based YOLACT++  by about AP ( vs. ) and has almost the same speed (43 FPS vs. 44 FPS). By further using a strong backbone DLA-34 , CondInst-RT can achieve 47 FPS with similar performance. Furthermore, if we do not share the classification and box regression towers in FCOS, the performance can be improved to AP with slightly longer inference time.
3.8 Instance Segmentation on Cityscapes
|method||backbone||training data||AP [val]||AP||AP||person||rider||car||truck||bus||train||mcycle||bicycle|
|CondInst w/ sem.||ResNet-50-FPN||train||33.9||28.6||53.1||31.3||24.2||51.9||21.2||35.9||26.5||20.9||17.0|
Instance segmentation results on Cityscapesval (“AP [val]” column) and test (remaining columns) splits. “DCN”: using deformable convolutions in the backbones. “+COCO”: fine-tuning from the models pre-trained on COCO. “train+val+COCO”: using both train and val splits to train the models evaluated on the test split. “w/ sem.”: using the auxiliary semantic segmentation loss during training as in COCO.
We also conduct the instance segmentation experiments on Cityscapes . The Cityscapes dataset is deigned for the understanding of urban street scenes. For instance segmentation, it has 8 categories, which are person, rider, car, truck, bus, train, motorcycle, and bicycle. It includes 2975, 500 and 1525 images with fine annotations for training, validation and testing, respectively. It also has 20K training images with coarse annotations. Following Mask R-CNN , we only use the images with fine annotations to train our models. All images have the same resolution 20481024. The performance on Cityscapes is also measured with the COCO-style mask AP, which are the averaged mask AP over ten IoU thresholds from to .
We follow the training details in Detectron2  to train CondInst on Cityscapes. Specifically, the models are trained for 24K iterations with batch size 8 (1 image per GPU). The initial learning rate is , which is reduced by a factor of 10 at step 18K. Since Cityscapes has relatively fewer images, following Mask R-CNN, we may initialize the models with the weights pre-trained on the COCO dataset if specified. Moreover, we use multi-scale data augmentation during training and the shorter side of the images is sampled in the range from 800 to 1024 with step 32. In inference, we only use the original image scale 20481024. Additionally, in order to preserve more details on Cityscapes, we increase the mask output resolution of CondInst from to resolution of the input image.
The results are reported in Table VIII. As shown in the table, with the same settings, CondInst generally outperforms the previous strong baseline Mask R-CNN by more than mask AP in all the experiments. On Cityscapes, the auxiliary semantic segmentation loss can also improve the instance segmentation performance. The results with the loss are denoted by “w/ sem.” in Table VIII. By further using the complementary techniques such as deformable convolutions and BiFPN, the performance can be further boosted as expected. Some qualitative results are shown in Fig. 5.
3.9 Experiments on Panoptic Segmentation
|DeeperLab ||Xception-71 ||-||34.3||37.5||29.6|
As mentioned before, CondInst can be easily extended to panoptic segmentation  by attaching a new semantic segmentation branch depicted in Fig. 4. Here, we conduct the panoptic segmentation experiments on the COCO 2018 dataset. Unless specified, the training and testing details (e.g., image sizes, the number of iterations and etc.) are the same as in the instance segmentation task on COCO.
Although panoptic segmentation can be viewed as a combination of instance segmentation and semantic segmentation. An annotation issue is that there is a discrepancy between the targets of the original instance segmentation and the instance segmentation task in panoptic segmentation. Panoptic segmentation requires that a pixel in the resulting mask has only one label. Therefore if two instances overlap, the pixels in the overlapped region will only be assigned to the front instance. However, in the original instance segmentation, the pixels in the overlapped region belong to both instances, and the ground-truth masks are labeled in such a way. Therefore, when we use the instance segmentation framework for panoptic segmentation, the training targets of the instance segmentation is changed to the instance annotations in panoptic segmentation accordingly.
We compare our method with a few state-of-the-art panoptic segmentation methods in Table IX. On the challenging COCO test-dev benchmark, we outperform the previous strong baseline Panoptic-FPN  by a large margin with the same backbone and training schedule (i.e., from to in PQ with ResNet101-FPN). Moreover, compared to AdaptIS , which shares some similarity with us, the ResNet-101 based CondInst achieves dramatically better performance than ResNeXt-101 based AdaptIS ( vs. PQ). This suggests that using the dynamic filters here might be more effective than using FiLM . In addition, compared to the recent methods such as  and Panoptic-FCN , CondInst also outperforms them considerably. We show also some qualitative results in Fig. 6.
We have proposed a new and simple instance segmentation framework, termed CondInst. Unlike previous method such as Mask R-CNN, which employs the mask head with fixed weights, CondInst conditions the mask head on instances and dynamically generates the filters of the mask head. This not only reduces the parameters and computational complexity of the mask head, but also eliminates the ROI operations, resulting in a faster and simpler instance segmentation framework. To our knowledge, CondInst is the first framework that can outperform Mask R-CNN both in accuracy and speed, without longer training schedules needed. With simple modifications, CondInst can be extended to solve panoptic segmentation and achieve state-of-the-art performance on the challenging COCO dataset. We believe that CondInst can be a strong alternative for both instance and panoptic segmentation.
-  A. Paszke et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. Advances in Neural Inf. Process. Syst., pages 8024–8035. 2019.
-  J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid. Unsupervised scale-consistent depth and ego-motion learning from monocular video. In Proc. Advances in Neural Inf. Process. Syst., pages 35–45, 2019.
-  D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee. YOLACT++: Better real-time instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 2019.
-  D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee. YOLACT: real-time instance segmentation. In Proc. IEEE Int. Conf. Comp. Vis., pages 9157–9166, 2019.
-  L. Boominathan, S. Kruthiventi, and R. V. Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proc. ACM Int. Conf. Multimedia, pages 640–644. ACM, 2016.
-  H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan. Blendmask: Top-down meets bottom-up for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
-  K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. Hybrid task cascade for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 4974–4983, 2019.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2017.
-  X. Chen, R. Girshick, K. He, and P. Dollár. Tensormask: A foundation for dense object segmentation. In Proc. IEEE Int. Conf. Comp. Vis., pages 2061–2069, 2019.
-  B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen. Panoptic-deeplab. arXiv: Comp. Res. Repository, 2019.
-  F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1251–1258, 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The Cityscapes dataset for semantic urban scene understanding.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3213–3223, 2016.
-  J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive fully convolutional networks. In Proc. Eur. Conf. Comp. Vis., pages 534–549. Springer, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 248–255. Ieee, 2009.
-  A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama, and K. P. Murphy. Semantic instance segmentation via deep metric learning. arXiv: Comp. Res. Repository, 2017.
-  K. He, R. Girshick, and P. Dollár. Rethinking imagenet pre-training. In Proc. IEEE Int. Conf. Comp. Vis., pages 4918–4927, 2019.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In Proc. IEEE Int. Conf. Comp. Vis., pages 2961–2969, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1904–1916, 2015.
-  T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan. Knowledge adaptation for efficient semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 578–587, 2019.
-  Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang. Mask scoring R-CNN. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6409–6418, 2019.
-  X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In Proc. Advances in Neural Inf. Process. Syst., pages 667–675, 2016.
-  A. Kirillov, R. Girshick, K. He, and P. Dollár. Panoptic feature pyramid networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 6399–6408, 2019.
-  A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár. Panoptic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 9404–9413, 2019.
-  Q. Li, X. Qi, and P. H. S. Torr. Unifying training and inference for panoptic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 13320–13328, 2020.
-  Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia. Fully convolutional networks for panoptic segmentation. arXiv: Comp. Res. Repository, 2020.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2117–2125, 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2980–2988, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Proc. Eur. Conf. Comp. Vis., pages 740–755. Springer, 2014.
-  F. Liu, C. Shen, G. Lin, and I. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell., 2016.
-  S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 8759–8768, 2018.
-  Y. Liu, C. Shu, J. Wang, and C. Shen. Structured knowledge distillation for dense prediction. IEEE Trans. Pattern Anal. Mach. Intell., 2020.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3431–3440, 2015.
-  M. Abadi et al. TensorFlow: A system for large-scale machine learning. In USENIX Symp. Operating Systems Design & Implementation, pages 265–283, 2016.
F. Milletari, N. Navab, and S.-A. Ahmadi.
V-net: Fully convolutional neural networks for volumetric medical image segmentation.In Proc. Int. Conf. 3D Vision, pages 565–571. IEEE, 2016.
-  D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool. Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 8837–8845, 2019.
-  A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping. In Proc. Advances in Neural Inf. Process. Syst., pages 2277–2287, 2017.
-  D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi. Semi-convolutional operators for instance segmentation. In Proc. Eur. Conf. Comp. Vis., pages 86–102, 2018.
-  E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. FiLM: Visual reasoning with a general conditioning layer. In Proc. AAAI Conf. Artificial Intell., 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. Advances in Neural Inf. Process. Syst., pages 91–99, 2015.
-  K. Sofiiuk, O. Barinova, and A. Konushin. Adaptis: Adaptive instance selection network. In Proc. IEEE Int. Conf. Comp. Vis., pages 7355–7363, 2019.
-  Z. Tian, T. He, C. Shen, and Y. Yan. Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3126–3135, 2019.
-  Z. Tian, C. Shen, and H. Chen. Conditional convolutions for instance segmentation. In Proc. Eur. Conf. Comp. Vis., 2020.
-  Z. Tian, C. Shen, H. Chen, and T. He. FCOS: Fully convolutional one-stage object detection. In Proc. IEEE Int. Conf. Comp. Vis., pages 9627–9636, 2019.
-  Z. Tian, C. Shen, H. Chen, and T. He. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell., 2021.
-  H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853, 2020.
-  X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li. SOLO: Segmenting objects by locations. In Proc. Eur. Conf. Comp. Vis., 2020.
-  X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen. SOLOv2: Dynamic and fast instance segmentation. In Proc. Advances in Neural Inf. Process. Syst., 2020.
-  Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
-  E. Xie, P. Sun, X. Song, W. Wang, D. Liang, C. Shen, and P. Luo. PolarMask: Single shot instance segmentation with polar representation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
-  Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun. Upsnet: A unified panoptic segmentation network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 8818–8826, 2019.
-  B. Yang, G. Bender, Q. V. Le, and J. Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. In Proc. Advances in Neural Inf. Process. Syst., pages 1305–1316, 2019.
-  T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, and L.-C. Chen. Deeperlab: Single-shot image parser. arXiv preprint arXiv:1902.05093, 2019.
-  W. Yin, Y. Liu, C. Shen, and Y. Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proc. IEEE Int. Conf. Comp. Vis., 2019.
-  W. Yin, X. Wang, C. Shen, Y. Liu, Z. Tian, S. Xu, C. Sun, and D. Renyin. Diversedepth: Affine-invariant depth prediction using diverse data. arXiv preprint arXiv:2002.00569, 2020.
-  F. Yu, D. Wang, E. Shelhamer, and T. Darrell. Deep layer aggregation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2403–2412, 2018.