Panoptic SegFormer

by   Zhiqi Li, et al.

We present Panoptic SegFormer, a general framework for end-to-end panoptic segmentation with Transformers. The proposed method extends Deformable DETR with a unified mask prediction workflow for both things and stuff, making the panoptic segmentation pipeline concise and effective. With a ResNet-50 backbone, our method achieves 50.0% PQ on the COCO test-dev split, surpassing previous state-of-the-art methods by significant margins without bells and whistles. Using a more powerful PVTv2-B5 backbone, Panoptic-SegFormer achieves a new record of 54.1%PQ and 54.4% PQ on the COCO val and test-dev splits with single scale input.


page 8

page 9


UPSNet: A Unified Panoptic Segmentation Network

In this paper, we propose a unified panoptic segmentation network (UPSNe...

Unifying Training and Inference for Panoptic Segmentation

We present an end-to-end network to bridge the gap between training and ...

Learning to Fuse Things and Stuff

We propose an end-to-end learning approach for panoptic segmentation, a ...

CBNet: A Novel Composite Backbone Network Architecture for Object Detection

In existing CNN based detectors, the backbone network is a very importan...

MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers

We present MaX-DeepLab, the first end-to-end model for panoptic segmenta...

ResNeSt: Split-Attention Networks

While image classification models have recently continued to advance, mo...

A New Mask R-CNN Based Method for Improved Landslide Detection

This paper presents a novel method of landslide detection by exploiting ...

Code Repositories


Implementation of Panoptic Segformer, in Pytorch

view repo


SOTA Panoptic Segmentation Models in PyTorch

view repo

1 Introduction

Semantic segmentation and instance segmentation are two important and correlated vision problems. Their underlying connections recently motivated panoptic segmentation as a unification of both tasks [16]. In panoptic segmentation, image contents are divided into two types: things and stuff. Things are countable instances (e.g., person, car, and bicycle) and each instance has a unique id to distinguish it from the other instances. Stuff refers to the amorphous and uncountable regions (e.g., sky, grassland, and snow) and has no instance id [16].

The differences between things and stuff also lead to different ways to handle their predictions. A number of works simply decompose panoptic segmentation into an instance segmentation task for things and a semantic segmentation task for stuff [16, 15, 26, 39, 25]. However, such a separated strategy tend to increase model complexity and undesired artifacts. Several works further consider bottom-up (proposal-free) instance segmentation approaches but still maintain similar separate strategies [41, 12, 2, 7, 33]. Some recent methods try to simplify the panoptic segmentation pipeline by processing things and stuff with a unified framework. For example, several works [30, 38, 19, 42] achieve this with fully convolutional frameworks. These framework share a similar “top-down meets bottom-up” two-branch design where a kernel branch encodes object/region information, and is dynamically convolved with an image-level feature branch to generate the object/region masks.

PQ (%) #Param (M)
DETR-R50 [3] 43.5 42.8
Max-Deeplab-S [32] 48.4 62.0
Max-Deeplab-L [32] 51.1 451.0
MaskFormer-T [8] 47.4 42.0
MaskFormer-B [8] 51.1 102.0
MaskFormer-L [8] 52.7 212.0
Panoptic SegFormer-B0 49.6 22.2
Panoptic SegFormer-B2 52.6 41.6
Panoptic SegFormer-B5 54.1 100.9
Figure 1: Comparison to the prior arts in panoptic segmentation methods on the COCO val2017 split. Under comparable number of parameters, Panoptic SegFormer models outperform the other counterparts among different models. Panoptic SegFormer (PVTv2-B5) achieves a new state-of-the-art 54.1%PQ, outperforming the previous best method MaskFormer by 1.4% with significantly fewer parameters.

Recently, Vision Transformers have been widely introduced to instance localization and recognition tasks [3, 43, 36, 23]. Vision Transformers generally divide an input image into crops and encode them as tokens. For object detection problems, both DETR [3] and Deformable DETR [43] represent the object proposals with a set of learnable queries which are used to predict bounding boxes and are dynamically matched with object ground truths via a bipartite graph matching loss. The role of query features is similar to RoI features in conventional detection architectures, thus inspiring several methods [3, 8, 32] with two-branch designs similar to Panoptic FCN [19].

In this work, we propose Panoptic SegFormer, a concise and effective framework for end-to-end panoptic segmentation with Vision Transformers. Specifically, Panoptic SegFormer contains three key designs:

  • [leftmargin=*]

  • A query set to represent things and stuff uniformly, where the stuff classes are considered as special type of things with single instance ids;

  • A location decoder which focuses on leveraging the location information of things and stuff to improve the segmentation quality;

  • A mask-wise post-processing strategy to equally merge the segmentation results of things and stuff.

Benefiting from these three designs, Panoptic SegFormer achieves state-of-the-art panoptic segmentation performance tasks with efficiency.

To verify our framework, we conduct extensive experiments on the COCO dataset [22]. As shown in Figure 1, our smallest model, Panoptic SegFormer (PVTv2-B0), achieves 49.0% PQ on the COCO val2017 split with only 22.2M parameters, surpassing prior arts such as MaskFormer [8] and Max-Deeplab [32], whose parameter sizes are twice and three times larger. Panoptic SegFormer (PVTv2-B5) further achieves the state PQ of 54.1%, which is 3% PQ higher than Max-Deeplab (51.1% PQ) and 1.4% PQ higher than MaskFormer (52.7% PQ), respectively, while our method still enjoys significantly fewer parameters. It is worth mentioning that Panoptic SegFormer achieves 54.4% PQ on COCO test-dev with single scale input, outperforming competition methods including Innovation [4], which uses plenty of tricks such as model ensemble, multi-scale testing. Currently, Panoptic SegFormer (PVTv2-B5) is the 1st place on COCO Panoptic Segmentation leaderboard111

Figure 2: Overview of Panoptic SegFormer. Panoptic SegFormer is composed of backbone, encoder, and decoder. The backbone and the encoder output and refines multi-scale features. Inputs of the decoder are queries and the multi-scale features. The decoder consists of two sub-decoders: location decoder and mask decoder, where location decoder aims to learn reference points of queries, and mask decoder predicts the final category and mask. Details of the decoder will be introduced below. We use a mask-wise merging method instead of the commonly used pixel-wise argmax method to perform inference.

2 Related Work

Panoptic Segmentation.

The panoptic segmentation literature mainly treat this problem as a joint task of instance segmentation and semantic segmentation where things and stuff are handled separately. Kirillov et al[16] proposed the concept of and benchmark of panoptic segmentation together with a baseline which directly combines the outputs of individual instance segmentation and semantic segmentation models. Since then, models such as Panoptic FPN [15], UPSNet [39] and AUNet [18] have improved the accuracy and reduced the computational overhead by combining instance segmentation and semantic segmentation into a single model. However, these methods still approximate the target task by solving the surrogate sub-tasks, therefore introducing undesired model complexities and sub-optimal performance.

Recently, efforts have been made to unified framework of panoptic segmentation. Li et al[19] proposed Panoptic FCN where the panoptic segmentation pipeline is simplified with a “top-down meets bottom-up” two-branch design similar to CondInst [30]. In their work, things and stuff are jointly modeled by an object/region-level kernel branch and an image-level feature branch. Several recent works represent things and stuff as queries and perform end-to-end panoptic segmentation via transformers. DETR [3] predicts the bounding boxes of things and stuff and combines the attention maps of the transformer decoder and the feature maps of ResNet [14] to perform panoptic segmentation. Max-Deeplab [32] directly predicts object categories and masks through a dual-path transformer regardless of the category being things or stuff. On top of DETR, MaskFomer [8] uses an additional pixel decoder to refine high spatial resolution features and generated the masks by multiplying queries and features from the pixel decoder. Due to the computational complexity of multi-head attention [31], both DETR and MaskFormer use feature maps with limited spatial resolutions for panoptic segmentation, which hurts the performance and requires combining additional high-resolution feature maps in final mask prediction. These methods have provided unified frameworks for predicting things and stuff in panoptic segmentation. However, there is still a noticeable gap between these methods and the top leaderboard methods with separated prediction strategies in terms of performance [4, 34].

End-to-end Object Detection.

The recent popularity of end-to-end object detection frameworks have inspired many other related works. DETR [3] is arguably the most representative end-to-end object detector among these methods. DETR models the object detection task as a dictionary lookup problem with learnable queries and employs an encoder-decoder transformer to predict bounding boxes without extra post-processing. DETR greatly simplifies the conventional detection framework and removes many hand-crafted components such as NMS [27, 21] and anchors [21]. Zhu et al[43] proposed Deformable DETR which further reduces the memory and computational cost in DETR through deformable attention layers. Although having these advantages, the attention maps of the deformable attention layers are sparse and cannot be directly used for dense prediction in panoptic segmentation.

Instance Segmentation.

Mask R-CNN [13] has been one of the most representative two-stage instance segmentation methods by first extracting ROIs and then predicting the final results conditioned on these ROIs. One-stage methods such as CondInst [30] and SOLOv2 [38] further simplifies this pipeline by employing dynamic filters (conditional convolution) [40] with a kernel branch. Recently, SOLQ [10] and QueryInst [11], perform instance segmentation in an end-to-end paradigm without involving NMS. QueryInst is based on an end-to-end object detector Sparse-RCNN [29]

and predicts masks through corresponding bounding boxes and queries. By encoding masks to vectors, SOLQ predicts mask vectors in a regressive manner and outputs the final masks by decoding the vectors. The proposed Panoptic SegFormer can also handle end-to-end instance segmentation by only predicting thing classes.

3 Methods

3.1 Overall architecture

As illustrated in Figure 2, Panoptic SegFormer consists of three key modules: transformer encoder, location decoder, and mask decoder, where (1) the transformer encoder is applied to refine the multi-scale feature maps given by the backbone, (2) the location decoder is designed to capturing object’s location clues, and (3) the mask decoder is for final classification and segmentation.

During the forward phase, we first feed the input image to the backbone network, and obtain the feature maps , , and from the last three stages, whose resolutions are , and compared to the input image, respectively. We then project the three feature maps to the ones with 256 channels by a fully-connected (FC) layer, and flatten them into feature tokens , , and . Here, we define as , and the shapes of , , and are , , and , respectively. Next, using the concatenated feature tokens as input, the transformer encoder outputs the refined features of size . After that, we use randomly initialized queries to uniformly describe things and stuff. We then embed the location clues (i.e. center location and scale (size of mask). Finally, we adopt a mask-wise strategy to merge the predicted masks into the panoptic segmentation result, which will be introduced in detail in Section 3.6.

3.2 Transformer Encoder

High-resolution and the multi-scale features maps are important for the segmentation task [15, 38, 19]. Since the high computational cost of multi-head attention layer, previous transformer-based methods [3, 8] can only process low-resolution feature map (e.g., of ResNet) in their encoders, which limits the segmentation performance.

Different from these methods, we employ the deformable attention layer [43] to implement our transformer encoder. Due to the low computational complexity of the deformable attention layer, our encoder can refine and involve positional encoding [31] to high-resolution and multi-scale feature maps .

3.3 Location Decoder

Location information plays an important role in distinguishing things with different instance ids in the panoptic segmentation task [37, 30, 38]. Inspired by this, we design a location decoder to introduce the location information (i.e., center location and scale) of things and stuff into the learnable queries.

Specifically, given randomly initialized queries and the refined feature tokens generated by transformer encoder, the decoder will output location-aware queries. In the training phase, we apply an auxiliary MLP head on top of location-aware queries to predict the center locations and scales of the target object, and supervise the prediction with a location loss . Note that, the MLP head is an auxiliary branch, which can be discarded during the inference phase. Since the location decoder does not need to predict the segmentation mask, we implement it with computational and memory efficient deformable attention [43].

Figure 3: Architecture of mask decoder. The attention maps are the product of query and key . We split and reshape multi-scale attention maps to ,then we upsample and cat these features to . The mask is generated through attention maps with one conv layer. The category label is predicted from the refined query with one linear projection layer.

3.4 Mask Decoder

As shown in Figure 3, the mask decoder is proposed to predict the object category and mask according to the given queries. The queries of the mask decoder is the location-aware queries from the location decoder, and the keys and values of the mask decoder is the refined feature tokens from the transformer encoder. We first pass the queries through 4 decoder layers, and then fetch the attention map and the refined query from the last decoder layer, where is the query number, is the head number of the multi-head attention layer, and is the length of feature tokens .

Similar to previous method [32, 3], we directly perform classification through a FC layer on top of the refined query from the last decoder layer.

At the same time, to predict the object mask, we first split and reshape the attention maps into attention maps , , and , which have the same spatial resolution as , , and . This process can be formulated as:


where denotes the split and reshaping operation. After that, we upsample these attention maps to the resolution of and concatenate them along the channel dimension, as illustrated in Eqn. 2.


Here, and

mean the 2 times and 4 times bilinear interpolation operations, respectively.

is the concatenation operation. Finally, based on the fused attention maps , we predict the binary mask through a convolution.

Note that, because a complete attention map is required to predict segmentation masks, we implement the mask decoder by the common multi-head attention [31], instead of sparse attention layer such as deformable attention [43] and Longfomer [1].

3.5 Loss Function

During training, follow common practices [3, 28] to search the best bipartite matching between the prediction set and the ground truth set , where is always guaranteed, and the ground truth set

is padded with

so that the element number is the same as the prediction set . Specifically, we utilize Hungarian algorithm [17] to search for the permutation with the minimum matching cost which is the sum of the classification loss and the segmentation loss .

The overall loss function of Panoptic SegFormer can be written as:


where , , and are the weights to balance three losses. is the classification loss that is implemented by Focal loss [21], and is the segmentation loss implemented by Dice loss [24]. is the location loss as formulated in Eqn. 4:


where is the L1 loss. and are the predicted center points and scales from the location decoder. denotes the index in the permutation . and indicate the center location and scale (size of mask that normalized by the size of the image) of the target mask , respectively. indicates that only pairs included real ground truth are taken into account.

3.6 Mask-Wise Inference

Panoptic Segmentation requires each pixel to be assigned a category label (or void) and instance id (id is ignored for stuff) [16]

. One commonly used post-processing method is the heuristic procedure 

[16], which adopts a NMS-like procedure [16] to generate the non-overlapping instance segments for things and we call it as mask-wise strategy here. The heuristic procedure also uses pixel-wise argmax strategy for stuff and resolves overlap between things and stuff in favor of the thing classes. Recent methods [8, 3, 32] directly use pixel-wise strategy directly to uniformly merge the results of things and stuff. Although pixel-wise argmax strategy is conceptually simple, we observe that it consistently produces results with noise due to the abnormally extreme pixel values. To this end, we adopt the mask-wise strategy to generate non-overlap results for stuff based on the heuristic procedure, instead of taking the pixel-wise strategy. However, we equally treat things and stuff and solve the overlaps among all masks by their confidence scores instead of favoring things over stuff in the heuristic procedure, which marks a difference between our approach and [16].

As illustrated in Algorithm 1, mask-wise merging strategy takes , , and as input, which denote the predicted categories, confidence scores, and segmentation masks, respectively, and output a semantic mask and a instance id mask , to assign a category label and a instance id to each pixel.

def MaskWiseMergeing(c,s,m):
# category
# confidence score
# mask SemMsk = np.zeros(H,W)
IdMsk = np.zeros(H,W)
order = np.argsort(-s)
id = 0
for i in order:
# drop low quality results
if s[i]< thr:
# drop overlaps
m = m[i] & (SemMsk==0)
SemMsk[m] = c[i]
if isThing(c[i]):
IdMsk[m] = id
id += 1
return SemMsk,IdMsk
Algorithm 1 Mask-Wise Merging

Specifically, and are first initialized by zeros. Then, we sorted prediction results in descending order of confidence score, and fill the sorted predicted masks to and . Note that, the results with confidence scores below will be discarded, and the overlaps with lower confidence score will be removed to generate non-overlap panoptic results. In the end, category label and instance id (only things) is added.

Method Backbone Epochs PQ #Param FLOPs

Panoptic FPN [15]
R50-FPN [14, 20] 36 41.5 48.5 31.1 - -
SOLOv2 [38] R50-FPN 36 42.1 49.6 30.7 - -
DETR [3] R50 43.4 48.2 36.3 42.8M 137G
Panoptic FCN [19] R50-FPN 36 43.6 49.3 35.0 37.0M 244G
K-Net [42] R50-FPN 36 45.1 50.3 37.3 - -
MaskFormer [8] R50 300 46.5 51.0 39.8 45.0M 181G

DETR [3]
R101 45.1 50.5 37.0 61.8M 157G
Max-Deeplab-S [32] Max-S 54 48.4 53.0 41.5 61.9M 162G
MaskFormer [8] R101 300 47.6 52.5 40.3 64.0M 248G

Max-Deeplab-L [32]
Max-L 54 51.1 57.0 42.2 451.0M 1846G

MaskFormer [8]
Swin-L [23] 300 52.7 58.5 44.0 212.0M 792G
Panoptic SegFormer R50 12 46.4 52.6 37.0 47.0M 246G
Panoptic SegFormer R50 50 50.0 56.1 40.8 47.0M 246G
Panoptic SegFormer R101 50 50.4 56.3 41.6 65.9M 322G
Panoptic SegFormer PVTv2-B0 [35] 50 49.6 55.5 40.6 22.2M 156G
Panoptic SegFormer PVTv2-B2 [35] 50 52.6 58.7 43.3 41.6M 219G
Panoptic SegFormer PVTv2-B5 [35] 50 54.1 60.4 44.6 100.9M 391G
Table 1: Experiments on COCO val set. Panotic SegFormer achieves 50.0% PQ on COCO val with ResNet-50 as backbone, surpasses previous methods such as DETR [3] and Panoptic FCN [8] over 6.6% PQ and 6.4% PQ respectively. Under training for 12 epochs, Panoptic SegFormer can achieve 46.4% PQ, which is comparable with 46.5% PQ of MaskFormer [8] that training for 300 epochs.

notes that backbones are pre-trained on ImageNet-22K.

Method Backbone Epochs PQ #Param FLOPs
Panoptic FPN [15] R101-FPN 36 43.5 50.8 32.5 - -
DETR [3] R101 46.0 - - 61.8M 157G
Panoptic FCN [19] R101-FPN 36 45.5 51.4 36.4 56.0M 310G
K-Net [42] R101-FPN 36 47.0 52.8 38.2 - -
Max-Deeplab-S [32] Max-S [32] 54 49.0 54.0 41.6 61.9M 162G

K-net [42]
Swin-L 36 52.1 58.2 42.8 - -
Max-Deeplab-L [32] Max-L [32] 54 51.3 57.2 42.4 451.0M 1846G
Innovation [4] ensemble - 53.5 61.8 41.1 - -
Panoptic SegFormer R50 50 50.0 56.2 40.8 47.0M 246G
Panoptic SegFormer R101 50 50.9 57.1 41.4 65.9M 322G
Panoptic SegFormer PVTv2-B5 [35] 50 54.4 61.1 44.3 100.9M 391G
Table 2: Experiments on COCO test-dev set. With PVTv2-B5 [35] as backbone, Panoptic SegFormer achieves 54.4% PQ on COCO test-dev, surpassed previous SOTA methods Max-Deeplabe-L [32] and competition-level methods Innovation [4] over 3.1% PQ and 0.9% PQ respectively with fewer parameters and computation cost.
Method Backbone Epochs
Mask R-CNN [13] R50-FPN 36 37.5 21.1 39.6 48.3
SOLOv2 [38] R50-FPN 36 38.8 16.5 41.7 56.2
SOLQ (300 queries) [10] R50 50 39.7 21.5 42.5 53.1
HTC [5] R50-FPN 36 40.1 23.3 42.1 52.0
QueryInst (300 queries) [11] R50-FPN 36 40.6 23.4 42.5 52.8
Panoptic SegFormer (300 queries) R50 50 41.7 21.9 45.3 56.3

Table 3: Instance segmentation experiments on COCO test-dev set. When training with things only, Panoptic SegFormer can perform instance segmentation. With ResNet-50 as backbone, Panoptic SegFormer achieves 41.7 mask AP on COCO test-dev, which is 1.6 AP higher than HTC [5].
Method Backbone #Param FLOPs Fps Memory
Deformable DETR*[6, 43] R50 39.8M 195G 15 4567M
Panoptic SegFormer R50 47.0M 246G 13 7722M
Panoptic SegFormer R101 65.9M 322G 10 8396M
Panoptic SegFormer PVTv2-B5 100.9M 391G 5 23112M

Table 4: Deformable-DETR is implemented in MMdet [6] and we use the same encoder with them. Data is measured from the same platform. FLOPs are computed on input images with a size of , Frame-per-second (Fps) is measured on a Tesla V100 GPU with a batch size of 1 by taking the average runtime on the entire val set. We obtain the memory consuming data during the training phase with a batch size of 1.

4 Experiments

Original Image Ours DETR [3] MaskFormer [8] Ground Truth
50.4% PQ 45.1% PQ 47.6% PQ
Figure 4: Comparing visualization results of Panoptic SegFormer with other methods on the COCO val set. For a fair comparison, all results are generated with ResNet-101 [14] as the backbone. The second and fourth row results show that our method still performs well in highly crowded or occluded scenes. Benefits from our mask-wise inference strategy, our results have few artifacts, which often appear in the results of DETR [3] (e.g., dining table of the third row).

We evaluate Panoptic SegFormer on COCO [22], comparing it with several state-of-the-art methods. We provide the main results of panoptic segmentation and some visualization results. We also report the results of instance segmentation.

4.1 Datasets

We perform experiments on COCO 2017 datasets [22] without external data. The COCO dataset contains 118K training images and 5k validation images, and it contains 80 things and 53 stuff.

4.2 Implementation Details

Our settings mainly follow DETR and Deformable DETR for simplicity. Specially, we use Channel Mapper [6] to map dimensions of the backbone’s outputs to 256. The location decoder contains 6 deformable attention layers, and the mask decoder contains 4 vanilla cross-attention layers. The hyper-parameters in deformable attention are the same as Deformable DETR [43]. We train our models with 50 epochs, a batch size of 1 per GPU, a learning rate of (decayed at the 40th epoch by a factor of 0.1, learning rate multiplier of the backbone is 0.1). We use a multi-scale training strategy with the maximum image-side not exceeding 1333 and the minimum image size varying from 480 to 800. The number of queries is set to 400. , , and in Equation 3 are set to 1, 1, 5, respectively. we employ threshold 0.5 to obtain binary masks from soft masks. Threshold used to filter low-quality results is 0.3. The PVTv2 [35]

is pre-trained on ImageNet-1K 

[9] set. All experiments are trained on one NVIDIA DGX node with 8 Tesla V100 GPUs. For our largest model Panoptic SegFormer (PVTv2-B5), we use 4 DGX nodes to shorten training time.

Panoptic segmentation.

We conduct experiments on COCO val set and test-dev set. In Tables 1 and 2, we report our main results, comparing with other state-of-the-art methods. Panoptic SegFormer attains 50.0% PQ on COCO val with ResNet-50 as the backbone and single-scale input, and it surpasses previous methods Panoptic-FCN [19] and DETR [3] over 6.4% PQ and 6.6% PQ, respectively. Except for the remarkable effect, the training of Panoptic SegFormer is efficient. Under training strategy (12 epochs) and ResNet-50 as the backbone, Panoptic SegFormer achieves 46.4% PQ that can be on par with 46.5% PQ of MaskFormer[8] that training 300 epochs. Enhanced by powerful vision transformer backbone PVTv2-B5 [35], Panoptic SegFormer attains a new record of 54.4% PQ on COCO test-dev without TTA, surpassing Max-Deeplab[32] over 3.1% PQ. Our method even surpasses the previous competition-level method Innovation [4] over 0.8 % PQ 222We only compare methods and results that dost not use external data.. Figure 4 shows some visualization results on the COCO val set. These original images are highly crowded or occluded scenarios, and our Panoptic SegFormer still can predict convincing results.

Instance segmentation.

In Table 3, we report our instance segmentation results on COCO test-dev set. For a fair comparison, we use 300 queries for instance segmentation and only things data is used. With ResNet-50 as the backbone and single scale input, Panoptic SegFormer achieves 41.7 mask AP, surpassing previous state-of-the-methods HTC [5] and QueryInst [11] over 1.6 AP and 1.1 AP, respectively.

Figure 5: Visualization of multi-head attention maps and corresponding outputs from mask decoder. Different heads have different preferences. Head 4 and Head 1 pay attention to foreground regions, and Head 8 prefers regions that occlude foreground. Head 5 always pays attention to the background that is around the foreground. Through the collaboration of these heads, Panoptic SegFormer can predict accurate masks. The 3rd row shows an impressive result of a horse that is highly obscured by the other horse.

visualization of attention maps

Different from previous methods, our results are generated through multi-scale multi-head attention maps. Figure 5 shows some samples of multi-head attention maps. Through a multi-head attention mechanism, different heads of one query learn their own attention preference. We observe that some heads pay attention to foreground regions, some heads prefer boundaries, and others prefer background regions. This shows that each mask is generated by considering various comprehensive information in the image.

4.3 Complexity of Panoptic SegFormer

We show model complexity and inference efficiency in Table 4, and we can see that Panoptic SegFormer can achieve state-of-the-art performance on panoptic segmentation with acceptable inference speed.

5 Conclusion

We propose a concise model named Panoptic SegFormer by unifying the processing workflow of things and stuff. Panoptic SegFormer can surpass previous methods with a large margin and demonstrate the superiority of treating things and stuff with the same recipe.


  • [1] I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv:2004.05150. Cited by: §3.4.
  • [2] U. Bonde, P. F. Alcantarilla, and S. Leutenegger (2020) Towards bounding-box free panoptic segmentation. In

    DAGM German Conference on Pattern Recognition

    Cited by: §1.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In ECCV, Cited by: Figure 1, §1, §2, §2, §3.2, §3.4, §3.5, §3.6, Table 1, Table 2, Figure 4, §4.2.
  • [4] C. Chen, J. Ren, D. Jin, Z. Cai, C. Yu, B. Wang, M. Zhang, and J. Wu Joint coco and mapillary workshop at iccv 2019: coco panoptic segmentation challenge track technical report: panoptic htc with class-guided fusion. SHR 56 (84.1), pp. 67–2. Cited by: §1, §2, Table 2, §4.2.
  • [5] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In CVPR, Cited by: Table 3, §4.2.
  • [6] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open MMLab detection toolbox and benchmark. arXiv:1906.07155. Cited by: Table 4, §4.2.
  • [7] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L. Chen (2020) Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, Cited by: §1.
  • [8] B. Cheng, A. G. Schwing, and A. Kirillov (2021) Per-pixel classification is not all you need for semantic segmentation. arXiv:2107.06278. Cited by: Figure 1, §1, §1, §2, §3.2, §3.6, Table 1, Figure 4, §4.2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.2.
  • [10] B. Dong, F. Zeng, T. Wang, X. Zhang, and Y. Wei (2021) SOLQ: segmenting objects by learning queries. arXiv:2106.02351. Cited by: §2, Table 3.
  • [11] Y. Fang, S. Yang, X. Wang, Y. Li, C. Fang, Y. Shan, B. Feng, and W. Liu (2021) QueryInst: parallelly supervised mask query for instance segmentation. arXiv:2105.01928. Cited by: §2, Table 3, §4.2.
  • [12] N. Gao, Y. Shan, Y. Wang, X. Zhao, Y. Yu, M. Yang, and K. Huang (2019) SSAP: single-shot instance segmentation with affinity pyramid. In ICCV, Cited by: §1.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask R-CNN. In ICCV, Cited by: §2, Table 3.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2, Table 1, Figure 4.
  • [15] A. Kirillov, R. Girshick, K. He, and P. Dollár (2019) Panoptic feature pyramid networks. In CVPR, Cited by: §1, §2, §3.2, Table 1, Table 2.
  • [16] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár (2019) Panoptic segmentation. In CVPR, Cited by: §1, §1, §2, §3.6.
  • [17] H. W. Kuhn (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §3.5.
  • [18] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, and X. Wang (2019) Attention-guided unified network for panoptic segmentation. In CVPR, Cited by: §2.
  • [19] Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia (2021) Fully convolutional networks for panoptic segmentation. In CVPR, Cited by: §1, §1, §2, §3.2, Table 1, Table 2, §4.2.
  • [20] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: Table 1.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §2, §3.5.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §1, §4.1, §4.
  • [23] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. ICCV. Cited by: §1, Table 1.
  • [24] F. Milletari, N. Navab, and S. Ahmadi (2016)

    V-net: fully convolutional neural networks for volumetric medical image segmentation

    In International conference on 3D vision (3DV), Cited by: §3.5.
  • [25] S. Qiao, L. Chen, and A. Yuille (2021) Detectors: detecting objects with recursive feature pyramid and switchable atrous convolution. In CVPR, Cited by: §1.
  • [26] J. Ren, C. Yu, Z. Cai, M. Zhang, C. Chen, H. Zhao, S. Yi, and H. Li (2021) REFINE: prediction fusion network for panoptic segmentation. In AAAI, Cited by: §1.
  • [27] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. NIPS. Cited by: §2.
  • [28] R. Stewart, M. Andriluka, and A. Y. Ng (2016) End-to-end people detection in crowded scenes. In CVPR, Cited by: §3.5.
  • [29] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al. (2021) Sparse R-CNN: end-to-end object detection with learnable proposals. In CVPR, Cited by: §2.
  • [30] Z. Tian, C. Shen, and H. Chen (2020) Conditional convolutions for instance segmentation. In ECCV, Cited by: §1, §2, §2, §3.3.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2, §3.2, §3.4.
  • [32] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L. Chen (2021) Max-deeplab: end-to-end panoptic segmentation with mask transformers. In CVPR, Cited by: Figure 1, §1, §1, §2, §3.4, §3.6, Table 1, Table 2, §4.2.
  • [33] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L. Chen (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. In ECCV, Cited by: §1.
  • [34] S. Wang, T. Liu, H. Liu, Y. Ma, Z. Li, Z. Wang, X. Zhou, G. Yu, E. Zhou, X. Zhang, et al. (2019) Joint coco and mapillary workshop at iccv 2019: panoptic segmentation challenge track technical report: explore context relation for panoptic segmentation. In ICCV Workshop, Cited by: §2.
  • [35] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) PVTv2: improved baselines with pyramid vision transformer. arXiv:2106.13797. Cited by: Table 1, Table 2, §4.2, §4.2.
  • [36] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid Vision Transformer: a versatile backbone for dense prediction without convolutions. In ICCV, Cited by: §1.
  • [37] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li (2020) Solo: segmenting objects by locations. In ECCV, Cited by: §3.3.
  • [38] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen (2020) SOLOv2: dynamic and fast instance segmentation. NeurIPS. Cited by: §1, §2, §3.2, §3.3, Table 1, Table 3.
  • [39] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun (2019) Upsnet: a unified panoptic segmentation network. In CVPR, Cited by: §1, §2.
  • [40] B. Yang, G. Bender, Q. V. Le, and J. Ngiam (2019) CondConv: conditionally parameterized convolutions for efficient inference. In NeurIPS, Cited by: §2.
  • [41] T. Yang, M. D. Collins, Y. Zhu, J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, and L. Chen (2019) Deeperlab: single-shot image parser. arXiv:1902.05093. Cited by: §1.
  • [42] W. Zhang, J. Pang, K. Chen, and C. C. Loy (2021) K-net: towards unified image segmentation. arXiv:2106.14855. Cited by: §1, Table 1, Table 2.
  • [43] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. In ICLR, Cited by: §1, §2, §3.2, §3.3, §3.4, Table 4, §4.2.