Image segmentation aims at finding groups of coherent pixels Szeliski (2010). There are different notions in groups, such as semantic categories (, car, dog, cat) or instances (, objects that coexist in the same image). Based on the different segmentation targets, the tasks are termed differently, , semantic and instance segmentation, respectively. Kirillov Kirillov et al. (2019b) propose panoptic segmentation to join the two tasks as a coherent scene segmentation task.
Grouping pixels according to semantic categories can be formulated as a dense classification problem. As shown in Fig. 1-(a), recent methods directly learn a set of kernels (namely semantic kernels
in this paper) of pre-defined categories and use them to classify pixelsLong et al. (2015) or regions He et al. (2017). Such a framework is elegant and straightforward. However, extending this notion to instance segmentation is non-trivial given the varying number of instances across images. Consequently, instance segmentation is tackled by more complicated frameworks with additional steps such as object detection He et al. (2017) or embedding generation Newell et al. (2017). These methods rely on extra components, which must guarantee the accuracy of extra components to a reasonable extent, or demand complex post-processing such as Non-Maximum Suppression (NMS) and pixel grouping. Recent approaches Wang et al. (2020b); Tian et al. (2020); Li et al. (2021) generate kernels from dense feature grids and then select kernels for segmentation to simplify the frameworks. Nonetheless, since they build upon dense grids to enumerate and select kernels, these methods still rely on hand-crafted post-processing to eliminate masks or kernels of duplicated instances.
In this paper, we make the first attempt to formulate a unified and effective framework to bridge the seemingly different image segmentation tasks (semantic, instance, and panoptic) through the notion of kernels. Our method is dubbed as K-Net (‘K’ stands for kernels). It begins with a set of kernels that are randomly initialized, and learns the kernels in accordance to the segmentation targets at hand, namely, semantic kernels for semantic categories and instance kernels for instance identities (Fig. 1-(b)). A simple combination of semantic kernels and instance kernels allows panoptic segmentation naturally (Fig. 1-(c)). In the forward pass, the kernels perform convolution on the image features to obtain the corresponding segmentation predictions.
The versatility and simplicity of K-Net are made possible through two novel designs. First, we formulate K-Net so that it dynamically updates the kernels to make them conditional to their activations on the image. Such a content-aware mechanism is crucial to ensure that each kernel, especially an instance kernel, responds accurately to varying objects in an image. Through applying this adaptive kernel update strategy iteratively, K-Net significantly improves the discriminative ability of the kernels and boosts the final segmentation performance. It is noteworthy that this strategy universally applies to kernels for all the segmentation tasks.
Second, inspired by recent advances in object detection Carion et al. (2020), we adopt the bipartite matching strategy Stewart et al. (2016) to assign learning targets for each kernel. This training approach is advantageous to conventional training strategies Ren et al. (2015); Lin et al. (2017b) as it builds a one-to-one mapping between kernels and instances in an image. It thus resolves the problem of dealing with a varying number of instances in an image. In addition, it is purely mask-driven without involving boxes. Hence, K-Net is naturally NMS-free and box-free, which is appealing to real-time applications.
To show the effectiveness of the proposed unified framework on different segmentation tasks, we conduct extensive experiments on COCO dataset Lin et al. (2014) for panoptic and instance segmentation, and ADE20K dataset Zhou et al. (2019) for semantic segmentation. Without bells and whistles, K-Net surpasses all previous state-of-the-art single-model results on panoptic (52.1% PQ) and semantic segmentation benchmarks (54.3% mIoU) and achieves competitive performance compared to the more expensive Cascade Mask R-CNN Cai and Vasconcelos (2018). We further analyze the learned kernels and find that instance kernels incline to specialize on objects at specific locations of similar sizes.
2 Related Work
Semantic Segmentation. Contemporary semantic segmentation approaches typically build upon a fully convolutional network (FCN) Long et al. (2015) and treat the task as a dense classification problem. Based on this framework, many studies focus on enhancing the feature representation through dilated convolution Chen et al. (2018a, 2017, b), pyramid pooling Zhao et al. (2017); Xiao et al. (2018), context representations Yuan et al. (2020); Zhang et al. (2018), and attention mechanisms Zhao et al. (2018); Yin et al. (2020). Recently, SETR Zheng et al. (2021) reformulates the task as a sequence-to-sequence prediction task by using a vision transformer Dosovitskiy et al. (2021). Despite the different model architectures, the approaches above share the common notion of making predictions via static semantic kernels. Differently, the proposed K-Net makes the kernels dynamic and conditional on their activations in the image.
Instance Segmentation. There are two representative frameworks for instance segmentation – ‘top-down’ and ‘bottom-up’ approaches. ‘Top-down’ approaches He et al. (2017); Dai et al. (2016); Li et al. (2017); Bolya et al. (2019) first detect accurate bounding boxes and generate a mask for each box. Mask R-CNN He et al. (2017) simplifies this pipeline by directly adding a FCN Long et al. (2015) in Faster R-CNN Ren et al. (2015). Extensions of this framework add a mask scoring branch Huang et al. (2019) or adopt a cascade structure Cai and Vasconcelos (2018); Chen et al. (2019a). ‘Bottom-up’ methods Bai and Urtasun (2017); Newell et al. (2017); Neven et al. (2019); Kirillov et al. (2017) first perform semantic segmentation then group pixels into different instances. These methods usually require a grouping process, and their performance often appears inferior to ‘top-down’ approaches in popular benchmarks Lin et al. (2014); Cordts et al. (2016). Unlike all these works, K-Net performs segmentation and instance separation simultaneously by constraining each kernel to predict one mask at a time for one object. Therefore, K-Net needs neither bounding box detection nor grouping process. It focuses on refining kernels rather than refining bounding boxes, different from previous cascade methods Cai and Vasconcelos (2018); Chen et al. (2019a).
Recent attempts Xie et al. (2020); Chen et al. (2019c); Wang et al. (2020a, b) perform instance segmentation in one stage without involving detection nor embedding generation. These methods apply dense mask prediction using dense sliding windows Chen et al. (2019c) or dense grids Wang et al. (2020a). Some studies explore polar Xie et al. (2020) representation, contour Peng et al. (2020), and explicit shape representation Xu et al. (2019)
of instance masks. These methods all rely on NMS to eliminate duplicated instance masks, which hinders end-to-end training. The heuristic process is also unfavorable for real-time applications. Instance kernels in K-Net are trained in an end-to-end manner with bipartite matching and set prediction loss, thus, our methods does not need NMS.
Panoptic Segmentation. Panoptic segmentation Kirillov et al. (2019b) combines instance and semantic segmentation to provide a richer understanding of the scene. Different strategies have been proposed to cope with the instance segmentation task. Mainstream frameworks add a semantic segmentation branch Kirillov et al. (2019a); Xiong et al. (2019); Li et al. (2020); Wang et al. (2020b) on an instance segmentation framework or adopt different pixel grouping strategies Cheng et al. (2020); Yang et al. (2019) based on a semantic segmentation method. Recently, DETR Carion et al. (2020) tries to simplify the framework by transformer Vaswani et al. (2017) but need to predict boxes around both stuff and things classes in training for assigning learning targets. These methods either need object detection or embedding generation to separate instances, which does not reconcile the instance and semantic segmentation in a unified framework. By contrast, K-Net partitions an image into semantic regions by semantic kernels and object instances by instance kernels through a unified perspective of kernels.
Dynamic Kernels. Convolution kernels are usually static, , agnostic to the inputs, and thus have limited representation ability. Previous works Dai et al. (2017); Zhu et al. (2019); Jaderberg et al. (2015); Jia et al. (2016); Gao et al. (2020) explore different kinds of dynamic kernels to improve the flexibility and performance of models. Some semantic segmentation methods apply dynamic kernels to improve the model representation with enlarged receptive fields Wu et al. (2018) or multi-scales contexts He et al. (2019a). Differently, K-Net uses dynamic kernels to improve the discriminative capability of the segmentation kernels more so than the input features of kernels.
Recent studies apply dynamic kernels to generate instanceTian et al. (2020); Wang et al. (2020b) or panoptic Li et al. (2021) segmentation predictions directly. Because these methods generate kernels from dense feature maps, enumerate kernels of each position, and filter out kernels of background regions, they either still rely on NMS Tian et al. (2020); Wang et al. (2020b) or need extra kernel fusion Li et al. (2021) to eliminate kernels or masks of duplicated objects. Instead of generated from dense grids, the kernels in K-Net are a set of learnable parameters updated by their corresponding contents in the image. K-Net does not need to handle duplicated kernels because its kernels learn to focus on different regions of the image in training, constrained by the bipartite matching strategy that builds a one-to-one mapping between the kernels and instances.
We consider various segmentation tasks through a unified perspective of kernels. The proposed K-Net uses a set of kernels to assign each pixel to either a potential instance or a semantic class (Sec. 3.1). To enhance the discriminative capability of kernels, we contribute a way to update the static kernels by the contents in their partitioned pixel groups (Sec. 3.2). We adopt the bipartite matching strategy to train instance kernels in an end-to-end manner (Sec. 3.3). K-Net can be applied seamlessly to semantic, instance, and panoptic segmentation as described in Sec. 3.4.
Despite the different definitions of a ‘meaningful group’, all segmentation tasks essentially assign each pixel to one of the predefined meaningful groups Szeliski (2010). As the number of groups in an image is typically assumed finite, we can set the maximum group number of a segmentation task as . For example, there are pre-defined semantic classes for semantic segmentation or at most objects in an image for instance segmentation. For panoptic segmentation, is the total number of stuff classes and objects in an image. Therefore, we can use kernels to partition an image into groups, where each kernel is responsible to find the pixels belonging to its corresponding group. Specifically, given an input feature map of
images, produced by a deep neural network, we only needkernels to perform convolution with to obtain the corresponding segmentation prediction as
where , , and are the number of channels, height, and width of the feature map, respectively.
This formulation has already dominated semantic segmentation for years Long et al. (2015); Zhao et al. (2017); Chen et al. (2017). In semantic segmentation, each kernel is responsible to find all pixels of a similar class across images. Whereas in instance segmentation, each pixel group corresponds to an object. However, previous methods separate instances by extra steps He et al. (2017); Newell et al. (2017); Kirillov et al. (2017) instead of by kernels.
This paper is the first study that explores if the notion of kernels in semantic segmentation is equally applicable to instance segmentation, and more generally panoptic segmentation. To separate instances by kernels, each kernel in K-Net only segments at most one object in an image (Fig. 1-(b)). In this way, K-Net distinguishes instances and performs segmentation simultaneously, achieving instance segmentation in one pass without extra steps. For simplicity, we call these kernels as semantic and instance kernels in this paper for semantic and instance segmentation, respectively. A simple combination of instance kernels and semantic kernels can naturally preform panoptic segmentation that either assigns a pixel to an instance ID or a class of stuff (Fig. 1-(c)).
3.2 Group-Aware Kernels
Despite the simplicity of K-Net, separating instances directly by kernels is non-trivial. Because instance kernels need to discriminate objects that vary in scale and appearance within and across images. Without a common and explicit characteristic like semantic categories, the instance kernels need stronger discriminative ability than static kernels.
To overcome this challenge, we contribute an approach to make the kernel conditional on their corresponding pixel groups, through a kernel update head shown in Fig. 2. The kernel update head contains three key steps: group feature assembling, adaptive kernel update, and kernel interaction. Firstly, the group feature for each pixel group is assembled using the mask prediction . As it is the content of each individual groups that distinguishes them from each other, is used to update their corresponding kernel adaptively. After that, the kernel interacts with each other to comprehensively model the image context. Finally, the obtained group-aware kernels perform convolution over feature map to obtain more accurate mask prediction . This process can be conducted iteratively because a finer partition usually reduces the noise in group features, which results in more discriminative kernels. This process is formulated as
Notably, the kernel update head and the iterative refinement is universal as it does not rely on the characteristic of kernels. Thus, it can enhance not only instance kernels but also semantic kernels. We detail the three steps as follows.
Group Feature Assembling. The kernel update head first assembles the features of each group, which will be adopted later to make the kernels group-aware. As the mask of each kernel in essentially defines whether or not a pixel belongs to the kernel’s related group, we can assemble the feature for by multiplying the feature map with the as
where is the batch size, is the number of kernels, and is the number of channels.
Adaptive Feature Update. The kernel update head then updates the kernels using the obtained to improve the representation ability of kernels. As the mask may not be accurate, which is more common the case, the feature of each group may also contain noises introduced by pixels from other groups. To reduce the adverse effect of the noise in group features, we devise an adaptive kernel update strategy. Specifically, we first conduct element-wise multiplication between and as
are linear transformations. Then the head learns two gates,and , which adapt the contribution from and to the updated kernel , respectively. The formulation is
where are different fully connected (FC) layers and
is the Sigmoid function.is then used in kernel interaction.
Kernel Interaction. Interaction among kernels is important to inform each kernel with contextual information from other groups. Such information allows the kernel to implicitly model and exploit the relationship between groups of an image. To this end, we add a kernel interaction process to obtain the new kernels given the updated kernels . Here we simply adopt Multi-Head Attention Vaswani et al. (2017)
followed by a Feed-Forward Neural Network, which has been proven effective in previous worksCarion et al. (2020); Vaswani et al. (2017). The output of kernel interaction is then used to generate a new mask prediction through . will also be used to predict classification scores in instance and panoptic segmentation.
3.3 Training Instance Kernels
While each semantic kernel can be assigned to a constant semantic class, there lacks an explicit rule to assign varying number of targets to instance kernels. In this work, we adopt bipartite matching strategy and set prediction loss Stewart et al. (2016); Carion et al. (2020) to train instance kernels in an end-to-end manner. Different from previous works Stewart et al. (2016); Carion et al. (2020) that rely on boxes, the learning of instance kernels is purely mask-driven because the inference of K-Net is naturally box-free.
The loss function for instance kernels is written as
where is Focal loss Lin et al. (2017b) for classification, and and are CrossEntropy (CE) loss and Dice loss Milletari et al. (2016) for segmentation, respectively. Given that each instance only occupies a small region in an image, CE loss is insufficient to handle the highly imbalanced learning targets of masks. Therefore, we apply Dice loss Milletari et al. (2016) to handle this issue following previous works Wang et al. (2020a, b); Tian et al. (2020).
Mask-based Hungarian Assignment. We adopt Hungarian assignment strategy used in Stewart et al. (2016); Carion et al. (2020) for target assignment to train K-Net in an end-to-end manner. It builds a one-to-one mapping between the predicted instance masks and the ground-truth (GT) instances based on the matching costs. The matching cost is calculated between the mask and GT pairs in a similar manner as the training loss.
3.4 Applications to Various Segmentation Tasks
Panoptic Segmentation. For panoptic segmentation, the kernels are composed of instance kernels and semantic kernels as shown in Fig. 2. We adopt semantic FPN Kirillov et al. (2019a) for producing high resolution feature map , except that we add positional encoding used in Vaswani et al. (2017); Carion et al. (2020); Zhu et al. (2020) to enhance the positional information. As semantic segmentation mainly relies on semantic information for per-pixel classification, while instance segmentation prefers accurate localization information to separate instances, we use two separate branches to generate the features and to perform convolution with and for generating instance and semantic masks and , respectively. This design is consistent with previous practices Kirillov et al. (2019a); Wang et al. (2020b).
We then construct , , and as the inputs of kernel update head to dynamically update the kernels and refine the panoptic mask prediction. Because ‘things’ are already separated by instance masks in , while contains the semantic masks of both ‘things’ and ‘stuff’, we select , the masks of stuff categories from , and directly concatenate it with to form the panoptic mask prediction . Due to similar reason, we only select and concatenate the kernels of stuff classes in with to form the panoptic kernels . To exploit the complementary semantic information in and localization information in , we add them together to obtain as the input feature map of the kernel update head. With the constructed , , and , the kernel update head can start produce group-aware kernels iteratively by times and obtain the refined mask prediction . We follow Panoptic FPN Kirillov et al. (2019a) to merge masks of things and stuff in based on their classification scores for the final panoptic segmentation results.
Instance Segmentation. In the similar framework, we simply remove the concatenation process of kernels and masks to perform instance segmentation. We did not remove the semantic segmentation branch as the semantic information is still complementary for instance segmentation. Note that in this case, the semantic segmentation branch does not use extra annotations. The ground truth of semantic segmentation is built by converting instance masks to their corresponding class labels.
Semantic Segmentation. As K-Net does not rely on specific architectures of model representation, K-Net can perform semantic segmentation by simply appending its kernel update head to any existing semantic segmentation methods Zhao et al. (2017); Long et al. (2015); Chen et al. (2017); Xiao et al. (2018) that rely on semantic kernels.
Dataset and Metrics. For panoptic and instance segmentation, we perform experiments on the challenging COCO dataset Lin et al. (2014). All models are trained on the train2017 split and evaluated on the val2017 split. The panoptic segmentation results are evaluated by the PQ metric Kirillov et al. (2019b). We also report the performance of thing and stuff, noted as PQTh, PQSt, respectively, for thorough evaluation. The instance segmentation results are evaluated by mask AP Lin et al. (2014). The AP for small, medium and large objects are noted as AP, AP, and AP, respectively. The AP at mask IoU thresholds 0.5 and 0.75 are also reported as AP and AP, respectively. For semantic segmentation, we conduct experiments on the challenging ADE20K dataset Zhou et al. (2019) and report mIoU to evaluate the segmentation quality. All models are trained on the train split and evaluated on the validation split.
Implementation Details. For panoptic and instance segmentation, we implement K-Net with MMDetection Chen et al. (2019b)
. In the ablation study, the model is trained with a batch size of 16 for 12 epochs. The learning rate is 0.0001, and it is decreased by 0.1 after 8 and 11 epochs, respectively. We use AdamWLoshchilov and Hutter (2019) with a weight decay of 0.05. For data augmentation in training, we adopt horizontal flip augmentation with a single scale. The long edge and short edge of images are resized to 1333 and 800, respectively, without changing the aspect ratio. When comparing with other frameworks, we use multi-scale training with a longer schedule (36 epochs) for fair comparisons Chen et al. (2019b). The short edge of images is randomly sampled from He et al. (2019b).
For semantic segmentation, we implement K-Net with MMSegmentation Contributors (2020) and train it with 80,000 iterations. As AdamW Loshchilov and Hutter (2019) empirically works better than SGD, we use AdamW with a weight decay of 0.0005 by default on both the baselines and K-Net for a fair comparison. The initial learning rate is 0.0001, and it is decayed by 0.1 after 60000 and 72000 iterations, respectively. More details are provided in the appendix.
Model Hyperparameters.In the ablation study, we adopt ResNet-50 He et al. (2016) backbone with FPN Lin et al. (2017a). For panoptic and instance segmentation, we use for Focal loss following previous methods Zhu et al. (2020), and empirically find , , work best. For efficiency, the default number of instance kernels is 100. For semantic segmentation, equals to the number of classes of the dataset, which is 150 in ADE20K and 133 in COCO dataset. The number of rounds of iterative kernel update is set to three by default for all segmentation tasks.
4.1 Benchmark Results
Panoptic Segmentation. We first benchmark K-Net with other panoptic segmentation frameworks in Table 1. K-Net surpasses the previous state-of-the-art box-based method Li et al. (2020) and box/NMS-free method Li et al. (2021) by 1.7 and 1.5 PQ on val split, respectively. On the test-dev split, K-Net with ResNet-101-FPN backbone even obtains better results than that of UPSNet Xiong et al. (2019), which uses Deformable Convolution Network (DCN) Dai et al. (2017). K-Net equipped with DCN surpasses the previous method Li et al. (2020) by 1.1 PQ. Without bells and whistles, K-Net achieves new state-of-the-art single-model performance with Swin Transformer Liu et al. (2021) serving as the backbone. Note that only 100 instance kernels are used here for efficiency. K-Net could obtain a higher performance with more instance kernels (Sec. 4.2), as well as an extended training schedule with aggressive data augmentation used in previous work Carion et al. (2020).
Instance Segmentation. We compare K-Net with other instance segmentation frameworks He et al. (2017); Tian et al. (2020); Chen et al. (2019c) in Table 2. More details are in the appendix. As the only box-free and NMS-free method, K-Net achieves better performance and faster inference speed than Mask R-CNN He et al. (2017), SOLO Wang et al. (2020a), SOLOv2 Wang et al. (2020b) and CondInst Tian et al. (2020), indicated by the higher AP and frames per second (FPS). We adopt 256 instance kernels (K-Net-N256 in the table) to compare with Cascade Mask R-CNN Cai and Vasconcelos (2018). The performance of K-Net-N256 is on par with Cascade Mask R-CNN Cai and Vasconcelos (2018) but enjoys a 92.2% faster inference speed (19.8 v.s 10.3). On COCO test-dev split, K-Net with ResNet-101-FPN backbone obtains performance that is 0.9 AP better than Mask R-CNN He et al. (2017). It also surpasses previous kernel-based approach CondInst Tian et al. (2020) and SOLOv2 Wang et al. (2020b) by 1.2 AP and 0.6 AP, respectively. With ResNet-101-FPN backbone, K-Net surpasses Cascade Mask R-CNN with 100 and 256 instance kernels in both accuracy and speed by 0.2 AP and 6.7 FPS, and 0.7 AP and 6 FPS, respectively.
Semantic Segmentation. We apply K-Net to existing frameworks Long et al. (2015); Zhao et al. (2017); Chen et al. (2017); Xiao et al. (2018) that rely on static semantic kernels in Table 3. K-Net consistently improves different frameworks by making their semantic kernels dynamic. Notably, K-Net significantly improves FCN (6.6 mIoU). This combination surpasses PSPNet and UperNet by 0.7 and 0.9 mIoU, respectively, and achieves performance comparable with DeepLab v3. Furthermore, the effectiveness of K-Net does not saturate with strong model representation, as it still brings significant improvement (1.4 mIoU) over UperNet with Swin Transformer Liu et al. (2021). The results suggest the versatility and effectiveness of K-Net for semantic segmentation. In Table 3, we further compare K-Net with other state-of-the-art methods Zheng et al. (2021); Chen et al. (2018b) with test-time augmentation on the validation set. With the input of 512512, K-Net already achieves state-of-the-art performance. With a larger input of 640640 following previous method Liu et al. (2021) during training and testing, K-Net with UperNet and Swin Transformer achieves new state-of-the-art single model performance, which is 0.8 mIoU higher than the previous one.
4.2 Ablation Study on Instance Segmentation
We conduct an ablation study on COCO instance segmentation dataset to evaluate the effectiveness of K-Net in discriminating instances. The conclusion is also applicable to other segmentation tasks since the design of K-Net is universal to all segmentation tasks.
Head Architecture. We verify the components in the kernel update head. The results in Table 4 indicates that both adaptive kernel update and kernel interaction are necessary for high performance.
Positional Information. We study the necessity of positional information in Table 4. The results show that positional information is beneficial, and positional encoding Vaswani et al. (2017); Carion et al. (2020) works slightly better than coordinate convolution. The combination of the two components does not bring additional improvements. The results justify the use of just positional encoding in our framework.
Number of Stages. We compare different kernel update rounds in Table 4. The results show that FPS decreases as the update rounds grow while the performance saturates beyond three stages.
Number of Kernels. We further study the number of kernels in K-Net. The results in Table 4 reveal that 100 kernels are sufficient to achieve good performance. The observation is expected for COCO dataset because most of the images in the dataset do not contain many objects (7.7 objects per image in average Lin et al. (2014)). K-Net consistently achieves better performance given more instance kernels since they improve the models’ capacity in coping with complicated images.
4.3 Visual Analysis
Overall Distribution of Kernels. We carefully analyze the properties of instance kernels learned in K-Net by analyzing the average of mask activations of the 100 instance kernels over the 5000 images in the val split. All the masks are resized to have a similar resolution of 200 200 for the analysis. As shown in Fig. 3, the learned kernels are meaningful. Different kernels specialize on different regions of the image and objects with different sizes, while each kernel attends to objects of similar sizes at close locations across images.
Masks Refined through Kernel Update. We further analyze how the mask predictions of kernels are refined through the kernel update in Fig. 3. Here we take K-Net for panoptic segmentation to thoroughly analyze both semantic and instance masks. The masks produced by static kernels are incomplete, , the masks of river and building are missed. After kernel update, the contents are thoroughly covered by the segmentation masks, though the boundaries of masks are still unsatisfactory. The boundaries are refined after more kernel update. The classification confidences of instances also increase after kernel update. More results are in appendix.
This paper explores instance kernels that can learn to separate instances during segmentation. Thus, extra components that previously assist instance segmentation can be replaced by instance kernels, including bounding boxes, embedding generation, and hand-crafted post-processing like NMS, kernel fusion, and pixel grouping. Such an attempt, for the first time, allows different image segmentation tasks to be tackled through a unified framework. The framework, dubbed as K-Net, first partitions an image into different groups by learned static kernels, then iteratively refines these kernels and their partition of the image by the features assembled from their partitioned groups. K-Net obtains new state-of-the-art single-model performance on panoptic and semantic segmentation benchmarks and surpasses the well-developed Cascade Mask R-CNN with the fastest inference speed among the recent instance segmentation frameworks. We wish K-Net and the analysis to pave the way for future research on unified image segmentation frameworks.
Appendix A Experiments
a.1 Implementation Details
We implement K-Net based on MMSegmentation Contributors (2020) for experiments on semantic segmentation. We use AdamW Loshchilov and Hutter (2019) with a weight decay of 0.0005 and train the model by 80000 iterations by default. The initial learning rate is 0.0001, and it is decayed by 0.1 after 60000 and 72000 iterations, respectively. This is different from the default training setting in MMSegmentation Contributors (2020) that uses SGD with momentum by 160000 iterations. But our setting obtains similar performance as the default one. Therefore, we apply AdamW with 80000 iterations to all the models in Table 3 (a) of the main text for efficiency while keeping fair comparisons. For data augmentation, we follow the default settings in MMSegmentation Contributors (2020). The long edge and short edge of images are resized to 2048 and 512, respectively, without changing the aspect ratio (described as 512 512 in the main text for short). Then random crop, horizontal flip, and photometric distortion augmentations are adopted.
a.2 Benchmark Results
Instance Segmentation. In Table 2 of the main text, we compare both accuracy and inference speed of K-Net with previous methods. For fair comparison, we re-implement Mask R-CNN He et al. (2017) and Cascade Mask R-CNN Cai and Vasconcelos (2018) with the multi-scale 3 training schedule Chen et al. (2019b); Wu et al. (2019), and submit their predictions to the evaluation server111https://competitions.codalab.org/competitions/20796 for obtaining their accuracies on the test-dev split. For SOLO Wang et al. (2020a), SOLOv2 Wang et al. (2020b), and CondInst Tian et al. (2020), we test and report the accuracies of the models released in their official implementation Tian et al. (2019), which are trained by multi-scale 3 training schedule. This is because the papers Wang et al. (2020a, b) of SOLO and SOLOv2 only report the results of multi-scale 6 schedule, and the AP, AP, and AP of CondInst Tian et al. (2020) are calculated based on the areas of bounding boxes rather than instance masks due to implementation bug. The performance of TensorMask Chen et al. (2019c) is reported from Table 3 of the paper. The results in Table 2 show that K-Net obtains better AP and AP but lower AP than Cascade Mask R-CNN. We hypothesize this is because Cascade Mask R-CNN rescales the regions of small, medium, and large objects to a similar scale of 28 28, and predicts masks on that scale. On the contrary, K-Net predicts all the masks on a high-resolution feature map.
We use frames per second (FPS) to benchmark the inference speed of the models. Specifically, we benchmark SOLO Wang et al. (2020a), SOLOv2 Wang et al. (2020b), CondInst Tian et al. (2020), Mask R-CNN He et al. (2017), Cascade Mask R-CNN Cai and Vasconcelos (2018) and K-Net with an NVIDIA V100 GPU. We calculate the pure inference speed of the model without counting in the data loading time, because the latency of data loading depends on the storage system of the testing machine and can vary in different environments. The reported FPS is an average FPS obtained in three runs, where each run measures the FPS of a model through 400 iterations Wu et al. (2019); Chen et al. (2019b). Note that the inference speed of these models may be updated due to better implementation and specific optimizations. So we present them in Table 2 only to verify that K-Net is fast and effective.
We also compare the number of parameters of these models in Table A1. Though K-Net does not have the least number of parameters, it is more lightweight than Cascade Mask R-CNN by approximately half number of the parameters (37.3 M vs. 77.1 M).
Semantic Segmentation. In Table 3 (b) of the main text, we compare K-Net on UperNet Xiao et al. (2018) using Swin Transformer Liu et al. (2021) with the previous state-of-the-art obtained by Swin Transformer Liu et al. (2021). We first directly test the model in the last row of Table 3 (a) of the main text (52.0 mIoU) with test-time augmentation and obtain 53.3 mIoU, which is on-par with the current state-of-the-art result (53.5 mIoU). Then we follow the setting in Swin Transformer Liu et al. (2021) to train the model with larger scale, which resize the long edge and short edge of images to 2048 and 640, respectively, during training and testing. The model finally obtains 54.3 mIoU on the validation set, which achieves new state-of-the-art performance on ADE20K.
a.3 Visual Analysis
Masks Refined through Kernel Update. We analyze how the mask predictions change before and after each round of kernel update as shown in Figure A1
. The static kernels have difficulties in handling the boundaries between masks, and the mask prediction cannot cover the whole image. The mask boundaries are gradually refined and the empty holes in big masks are finally filled through kernel updates. Notably, the mask predictions after the second and the third rounds look very similar, which means the discriminative capabilities of kernels start to saturate after the second round kernel update. The visual analysis is consistent with the evaluation metrics of a similar model on theval split, where the static kernels before kernel update only achieve 33.0 PQ, and the dynamic kernels after the first update obtain 41.0 PQ. The dynamic kernels after the second and the third rounds obtain 46.0 PQ and 46.3 PQ, respectively.
Failure Cases. We also analyze the failure cases and find two typical failure modes of K-Net. First, for the contents that have very similar texture appearance, K-Net sometimes have difficulties to distinguish them from each other and results in inaccurate mask boundaries and misclassification of contents. Second, as shown in Figure A2, in crowded scenarios, it is also challenging for K-Net to recognize and segment all the instances given limited number of instance kernels.
Appendix B Broader Impact
Simplicity and effectiveness are two significant properties pursued by computer vision algorithms. Our work pushes the boundary of segmentation algorithms through these two aspects by providing a unified perspective that tackles semantic, instance, and panoptic segmentation tasks consistently. The work could also ease and accelerate the model production and deployment in real-world applications, such as in autonomous driving, robotics, and mobile phones. The model with higher accuracy proposed in this work could also improve the safety of its related applications. However, due to limited resources, we do not evaluate the robustness of the proposed method on corrupted images and adversarial attacks. Therefore, the safety of the applications using this work may not be guaranteed. To mitigate that, we plan to analyze and improve the robustness of models in the future research.
- Deep watershed transform for instance segmentation. In CVPR, Cited by: §2.
- YOLACT: Real-time instance segmentation. In ICCV, Cited by: §2.
- Cascade R-CNN: Delving into high quality object detection. In CVPR, Cited by: §A.2, §A.2, §1, §2, §4.1, Table 2.
- End-to-end object detection with transformers. In ECCV, Cited by: §1, §2, §3.2, §3.3, §3.3, §3.4, §4.1, §4.2, Table 1.
- Hybrid task cascade for instance segmentation. In CVPR, Cited by: §2.
- MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §A.2, §A.2, §4.
- DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI. Cited by: §2.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2, §3.1, §3.4, §4.1, Table 3.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §2, §4.1, Table 3.
- TensorMask: a foundation for dense object segmentation. External Links: Cited by: §A.2, §2, §4.1, Table 2.
- Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, Cited by: §2, Table 1.
- MMSegmentation: openmmlab semantic segmentation toolbox and benchmark. Note: https://github.com/open-mmlab/mmsegmentation Cited by: §A.1, §4.
- . In CVPR, Cited by: §2.
- Instance-aware semantic segmentation via multi-task network cascades. CVPR. Cited by: §2.
- Deformable convolutional networks. In ICCV, Cited by: §2, §4.1.
- An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2, Table 3.
- Deformable Kernels: Adapting effective receptive fields for object deformation. In ICLR, Cited by: §2.
- Dynamic multi-scale filters for semantic segmentation. In ICCV, Cited by: §2.
Rethinking ImageNet pre-training. In ICCV, Cited by: §4.
- Mask R-CNN. In ICCV, Cited by: §A.2, §A.2, §1, §2, §3.1, §4.1, Table 2.
- Deep residual learning for image recognition. In CVPR, Cited by: §4.
- Mask Scoring R-CNN. In CVPR, Cited by: §2.
- Spatial transformer networks. In NeurIPS, Cited by: §2.
- Dynamic filter networks. In NeurIPS, Cited by: §2.
- Panoptic feature pyramid networks. In CVPR, Cited by: §2, §3.4, §3.4, Table 1.
- Panoptic segmentation. In CVPR, Cited by: §1, §2, §4.
- InstanceCut: from edges to instances with multicut. In CVPR, Cited by: §2, §3.1.
- Unifying training and inference for panoptic segmentation. In CVPR, Cited by: §2, §4.1, Table 1.
- Fully convolutional networks for panoptic segmentation. In CVPR, Cited by: §1, §2, §4.1, Table 1.
- Fully convolutional instance-aware semantic segmentation. CVPR. Cited by: §2.
- Feature pyramid networks for object detection. In CVPR, Cited by: §4.
- Focal loss for dense object detection. In ICCV, Cited by: §1, §3.3.
- Microsoft COCO: common objects in context. In ECCV, Cited by: §1, §2, §4.2, §4.
- Swin Transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §A.2, §4.1, §4.1, Table 1.
- Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1, §2, §2, §3.1, §3.4, §4.1, Table 3.
- Decoupled weight decay regularization. In ICLR, Cited by: §A.1, §4, §4.
V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, Cited by: §3.3.
- Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In CVPR, Cited by: §2.
- Associative Embedding: End-to-end learning for joint detection and grouping. In NeurIPS, Cited by: §1, §2, §3.1.
- Deep snake for real-time instance segmentation. In CVPR, Cited by: §2.
- Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, Cited by: §1, §2.
- End-to-end people detection in crowded scenes. In CVPR, Cited by: §1, §3.3, §3.3.
- Computer vision: algorithms and applications. Springer Science & Business Media. Cited by: §1, §3.1.
- AdelaiDet: a toolbox for instance-level recognition tasks. Note: https://git.io/adelaidet Cited by: §A.2.
- Conditional convolutions for instance segmentation. In ECCV, Cited by: §A.2, §A.2, §1, §2, §3.3, §4.1, Table 2.
- Attention is all you need. In NeurIPS, Cited by: §2, §3.2, §3.4, §4.2.
- SOLO: Segmenting objects by locations. In ECCV, Cited by: §A.2, §A.2, §2, §3.3, §4.1, Table 2.
- SOLOv2: Dynamic and fast instance segmentation. NeurIPS. Cited by: §A.2, §A.2, §1, §2, §2, §2, §3.3, §3.4, §4.1, Table 1, Table 2.
- Dynamic filtering with large sampling field for convnets. In ECCV, Cited by: §2.
- Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: §A.2, §A.2.
- Unified perceptual parsing for scene understanding. In ECCV, Cited by: §A.2, §2, §3.4, §4.1, Table 3.
- PolarMask: Single shot instance segmentation with polar representation. In CVPR, Cited by: §2.
- UPSNet: A unified panoptic segmentation network. In CVPR, Cited by: §2, §4.1, Table 1.
- Explicit shape encoding for real-time instance segmentation. In ICCV, Cited by: §2.
- DeeperLab: Single-shot image parser. CoRR abs/1902.05093. Cited by: §2.
- Disentangled non-local neural networks. In ECCV, Cited by: §2, Table 3.
- Object-contextual representations for semantic segmentation. In ECCV, Cited by: §2, Table 3.
- Context encoding for semantic segmentation. In CVPR, Cited by: §2.
- ResNeSt: split-attention networks. arXiv preprint arXiv:2004.08955. Cited by: Table 3.
- Pyramid scene parsing network. In CVPR, Cited by: §2, §3.1, §3.4, §4.1, Table 3.
- PSANet: Point-wise spatial attention network for scene parsing. In ECCV, Cited by: §2, Table 3.
- Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, Cited by: §2, §4.1, Table 3.
- Semantic understanding of scenes through the ADE20K dataset. IJCV. Cited by: §1, §4.
- Deformable ConvNets V2: More deformable, better results. In CVPR, Cited by: §2.
- Deformable DETR: deformable transformers for end-to-end object detection. CoRR abs/2010.04159. External Links: Cited by: §3.4, §4.