The goal of semantic segmentation is to assign each pixel of an image with a semantic class label. Over the past decade, encoder-decoder based methods are the mainstream models to address this task. They usually use convolution-  or transformer-based 
networks to produce dense predictions from deep features generated by encoder networks. Efforts have been made to either design stronger backbone encoders[25, 39, 10, 12] or decoders [3, 50, 49, 47, 41, 5] for various downstream dense prediction tasks.
In general, the above-mentioned semantic segmentation methods follow a per-pixel classification formulation: the outputs of the models are spatial segmentation maps with categorical assignment for each pixel. Recently, Cheng et al. proposed MaskFormer  to advocate a pathbreaking replacement for the per-pixel formulation. It adopts a novel per-mask classification paradigm for semantic and panoptic segmentation : instead of directly producing segmentation maps, the objective is to predict a set of paired probabilities and masks corresponding to category segments in an input image. To this end, MaskFormer uses a transformer decoder to obtain probability-mask pairs from a set of learnable object queries.
Despite the state-of-the-art performance of MaskFormer in semantic segmentation, we empirically find it is inclined to produce incorrect probability-mask pairs which degrade the quality of the prediction results. Specifically, we observe two types of prominent mistakes: (1) a query can produce a groundtruth look-alike mask but predicts wrong probability (see Fig. 0(a)), and (2) it can have a high probability for an in-groundtruth class but yields a corresponding mask with low quality (see Fig. 0(b)). Since MaskFormer decodes solely on the lowest resolution feature map ( of the input image size), such unsatisfactory results might be caused by insufficient semantic information in the single-scale feature. Intuitively, predicting object masks requires features containing dense information and rich spatial context, while predicting probabilities needs abstract categorical information. Due to the large variations of semantic information among different resolutions of features [21, 17], it is beneficial to take advantage of feature pyramid with multi-scale information for generating more accurate probability-mask pairs.
Multi-scale feature maps have been widely studied for many dense vision tasks such as detection and segmentation. [21, 35, 24] showed that multi-scale representations are beneficial for convolution-based object detection. Similarly, using multi-scale representations has also been widely explored in convolution-based segmentation frameworks [23, 32]. However, simply extending the techniques on multi-scale feature maps to transformer-based segmentation frameworks is non-trivial. A naive solution is to perform self-attention on the patch tokens from feature maps of all different scales. However, such an approach is too computationally demanding, as the computational cost of self-attention  is quadratic to the number of patch tokens of all scales. Existing works [20, 50] utilize a sparse attention mechanism to reduce the high computational cost, where each query only attends a sparse set of features for capturing image context. We argue that such sparse attention is sub-optimal for dense pixel prediction tasks, as they need to consider all pixels’ information for generating accurate dense predictions.
We propose to use a multi-scale transformer with cross-scale inter-query attention to effectively aggregate multi-scale information for segmentation. Specifically, we adopt a transformer decoder on the feature pyramid to produce segmentation predictions. As the semantic information at different scales can better capture different objects or structures, it is also important to communicate information across different feature scales [21, 17, 50, 23, 34]. We therefore perform cross-scale communication by introducing a cross-scale inter-query attention
mechanism, which uses queries for information summarization and communication bridges. Such a mechanism also allows queries of different scales to be informed of semantic information of other scales without directly computing on them. The final prediction is the average of the per-scale predictions in the logit space. With the above proposed ingredients, we name our solutionPyramid Fusion Transformer (PFT), a simple yet effective transformer-based segmentation transformer that efficaciously reasons from multi-scale feature maps with enhanced latent representations and consolidates predictions with high fidelity.
To demonstrate the effectiveness of our method, we conduct extensive experiments and ablation studies on three widely-used datasets and show state-of-the-art results. As shown in Tab. 1, with a Swin-B backbone, our model achieves mIoU on ADE20K dataset , matching the performance of MaskFormer with a much larger Swin-L backbone. When using the a Swin-L backbone, our model achieves 56.0 mIoU single-scale performance, which is competitive with MaskFormer’s under multi-scale inference. Furthermore, the same model obtains 57.2 multi-scale mIoU on ADE20K, achieving state-of-the-art result on the dataset.
2 Related Works
Vision Transformer: with the inspiration from success stories of transformer and attention mechanism  in NLP tasks , the vision community enjoys a recent surge of interests in adapting the transformer structures into solving various vision tasks [10, 15, 36, 3, 45]. The pioneering work  first proposes to “patchfiy” an image input into an unordered set of tokens, and sends them to a series of transformer layers consisting of self-attention modules and fully-connected layers with skip connections. On the task of image classification, it achieves competitive results with popular CNN networks such as ResNets . To better adapt the transformer structure to dense vision tasks, researchers have designed various variants with stage-by-stage feature maps with shrinking resolutions [25, 39, 40, 9, 44]. Swin Transformer  is arguably the most representative of these transformers with feature pyramid. In stead of performing self-attention on the entire set of patch tokens, it uses window partitions as constraints on where the attention mechanism is applied and allows communication among windows by using a shifted configuration. As a result, it achieves high performance in several downstream vision tasks with reduced computational costs and memory consumption.
Per-mask classification segmentation: MaskFormer  and Max-DeepLab  are among the pioneering works to use a mask classification paradigm in place of the end-to-end per-pixel classification segmentation approach. Contrary to Fully Convolutional Networks (FCNs) that predict segmentation maps with per-pixel labelings, they produce paired results for masks and their corresponding class labels. In particular, Max-DeepLab adds a transformer module to each backbone convolutional block and performs self/cross-attention to allow communications between the backbone and the decode head. MaskFormer outputs probability logits for classes and yields mask embeddings for mask generation with a transformer on top of the resolution features. Recently, Li et al.  proposes Panoptic Segformer, which shares a similar per-mask classification concept for semantic and panoptic segmentations. Specifically, it draws the predicted masks from the attention weights between location-aware queries and spatial features and predicts probabilities in a separate branch. During inference, the masks and probabilities are combined to generate the segmentation maps.
Our overall pipeline is shown in Fig. 2. Our proposed Pyramid Fusion Transformer (PFT) takes a feature pyramid encoded by a backbone network as input and adopts a novel multi-scale transformer decoder with a cross-scale inter-query attention mechanism to efficiently fuse multi-scale information for accurate semantic segmentation.
The backbone network, which can either be a convolutional or transformer network, receives an input image of sizeand produces a hierarchy of feature maps by a Feature Pyramid Network (FPN) with a uniform channel dimension . We flatten them into sequences of pixel tokens of length where respectively. Our PFT is applied on top of these flattened sequences obtained from multi-scale features. Each scale has a separate set of
queries to estimate the confidences and locations ofsemantic categories, where each query is only responsible for capturing semantic information of one assigned category. Within the transformer, we recurrently stack three types of attention layers: (1) an intra-scale query self-attention layer that conducts conventional self-attention between queries within the same scale, (2) a novel cross-scale inter-query attention layer to efficiently communicate scale-aware information using the limited number of queries of the different scales, and (3) an intra-scale query-pixel cross-attention layer that aggregates semantic information from flattened sequences of pixel tokens.
3.1 Pyramid Fusion Transformer with Cross-scale Inter-query Attention
Multi-scale information is important for achieving accurate scene understanding. Low-resolution feature maps are able to capture global context while high-resolution ones are better at discovering fine structures such as category boundaries[46, 17, 43, 31]. It is therefore vital to propagate information across the multiple scales to capture both global and find-grained information. However, due to the high computational cost of directly applying attention on the large number of multi-scale pixel tokens, previous transformer-based semantic segmentation methods such as [49, 20] often rely on the sparse attention mechanism on the pixel tokens to model pixel-to-pixel relations. Contrary to their approaches, to avoid heavy computation, we propose to efficiently fuse the multi-scale information with our proposed cross-scale inter-query attention mechanism. Instead of extracting global and local semantic information among the pixel tokens like the previous sparse attention-based methods, we propose to fuse the multi-scale information in the query embedding space. Three types of attention layers are recurrently stacked in our PFT to achieve the goal.
Intra-scale query self-attention.
Within each scale, the intra-scale query self-attention layer conducts self-attention only between category queries within the scale. Specifically, within each scale, the category queries with learnable positional encodings are input into the layer.
The category queries are zero-initialized at the first layer and updated by the stacked attention layers, while the learnable positional encodings are learned and shared at different depths. Following [3, 50, 6], the positional encodings are only used for encoding the embeddings for self-attention. The self-attention is conducted only between queries of the same scale to obtain the updated
. Such an intra-scale self-attention layer consists of the commonly used multi-head self-attention sub-layer and Feed-Forward Network (FFN) sub-layer with layer normalization and residual connection.
Cross-scale inter-query attention.
As the above intra-scale self-attention is limited within each scale, their sets of updated queries can only obtain information specific to each scale. To allow information propagation between the multiple scales for knowledge fusion, a novel cross-scale inter-query attention layer is introduced. As the number of such scale-aware queries in each scale is much smaller than the number of all visual tokens in each scale, we achieve information propagation across the multiple scales by conducting attention between the concatenated category queries of the four scales. The inter-query attention results are then sliced back to the queries of four scales, which serve as the input for the follow-up intra-scale query-pixel cross-attention layer.
To distinguish queries of different scales, we use the learnable positional encodings from the intra-scale query self-attention layer.
where the attention outputs a sequence of category queries of length and are sliced to sub-sequences of length and assigned to each for . In this way, the cross-scale attention and information communication is efficiently achieved by using the small number of category queries. The proposed cross-scale inter-query attention layer also consists of a multi-head attention sub-layer and an FFN sub-layer with layer normalization and residual connection following the classical design of the dot-product attention.
Intra-scale query-pixel cross-attention.
To aggregate dense pixel-level semantic information to the category queries, we conduct dense query-pixel cross-attention within each scale. Within each scale, the category queries perform cross-attention with the pixel tokens for using the multi-head attention.
where the learnable positional encodings are added to the category queries , and fixed sinusoidal positional encodings to the pixel tokens , because of the too large number of pixel tokens.
As illustrated by Fig. 2, in our pyramid fusion transformer, each transformer layer consists of the above three types of attention layers and is repeated times to form a -layer transformer. For intra-scale query attention and intra-scale query-pixel cross-attention, we use separate weights for the linear projection layers, layer normalization, etc. for each scale. The proposed cross-scale inter-query attention layers are placed between the intra-scale query self-attention layers and the intra-scale query-pixel cross-attention layers. The small number of category queries serve as the bridges to efficiently aggregate and propagate the pixel-level semantic information across the multiple scales. Neither intra-scale nor cross-scale pixel-to-pixel attention is used in our PFT, avoiding heavy computational cost of dense multi-scale information fusion.
Generating Segmentation Maps.
After the multiple transformer layers, the updated four sets of category queries with multi-scale information can be used to generate the probability-mask pairs for segmentation.
To generate probabilities of the categories for each scale, we use separate linear projections to map the category queries at the output of our PFT to
-dimensional binary probability logits. The probability logits of the multiple scales are averaged followed by a sigmoid function to generate the binary probabilities, which denotes the confidence of each category existing in the input image.
To generate the binary category masks for the multiple scales, we first transform the spatial feature with four separate sets of convolutions to produce the spatial features for four scales . Each set of category queries for a scale
then goes through a Multi-Layer Perceptron and take a matrix product with its corresponding spatial featureto produce the mask logits . The mask logits of the four scales are averaged followed by a sigmoid function to obtain the mask probability maps of each category.
where , , are of resolution of the input image, and stands for matrix multiplication.
For each category ’s probability and mask, , is from the -th item of , and is from the -th channel of . The semantic segmentation map is obtained by probability-mask marginalization as that in MaskFormer , where the category prediction for pixel is computed as
3.2 Training losses
To train our model, we adopt classification loss and mask loss to optimize the probabilities and masks. To construct the groundtruth for per-mask classification semantic segmentation, we decompose the segmentation map for each image into a set of groundtruth label-mask pairs , where is the categorical label for the binary mask that corresponds to all pixels belonging to the category. Here, is the number of categories present in the image, which is usually smaller than
and therefore we pad the groundtruth set with “not-exist”. During training, the probability-mask pair produced by the -th query from is matched to the groundtruth label-mask pair whose or matched to if the category is absent from the image.
The classification loss consist of two parts: a binary cross-entropy loss applied at the averaged probability logits from the four decoding scales and a focal-style  binary cross-entropy loss at each decoder outputs . If category is present in the input image, the -th query’s probability is expected to predict a high value for it, and therefore . Otherwise, we maximize the probability for predicting by using . Here, represents the absence of the category. A similar formulation is adopted for the focal-style classification loss at each scale, while we adaptively reweight the hard samples following . The detailed formulation of the focal-style binary cross-entropy loss can be found at Sec. A.1. Then, the classification loss is a linear combination of the above two losses , where and
are hyperparameters balancing the two terms.
To construct our mask loss , we use the same binary focal loss  and dice loss  in MaskFormer , where . Note that the mask loss is only applied at the averaged logits from the four scales and optimizes masks with groundtruth categories only. Masks corresponding to are simply discarded during training.
Our final training loss is a sum of the classification loss and the mask loss, where
In this section, we demonstrate the effectiveness of our method with competitive semantic segmentation results and compare to both state-of-the-art per-pixel classification and mask-level classification frameworks on three popular segmentation datasets, ADE20K , COCO-Stuff-10K 
, and PASCAL-Context. We choose MaskFormer  as our baseline model because of its strong performance among the mask-level classification methods [6, 38, 20]. In the ablations, we further study the effectiveness of our proposed components, including usage of multi-scale features, cross-scale inter-query attention, loss designs, and model parameter sharing. Experimental results demonstrate that our model can learn useful information from multi-scale feature maps to deliver high quality segmentation maps with our proposed per-scale transformer decoders and cross-scale inter-query attention module.
4.1 Datasets and Implementation details
ADE20K  is a semantic segmentation dataset with 150 fine-grained semantic categories, including thing and stuff. It contains 20,210 images for training, 2,000 images for validation and 3,352 images for testing. COCO-Stuff-10K  is a scene parsing dataset with 171 categories, not counting the class “unlabeled”. We follow the official split to partition the dataset into 9k images for training and 1k images for validation. PASCAL-Context  contains pixel-level annotations for the whole scenes with 4,998 images for training and 5,105 images for testing. We evaluate our method on the commonly used 59 classes of the dataset.
We use the open-source segmentation codebasemmsegmentation  to implement PFT. We adopt Swin Transformer  and ResNet  as backbone networks for evaluation. For ResNets, we report results obtained with ResNet-50 and ResNet-101, along with its slightly modified version ResNet-101c. ResNet-101c has its stem convolution layer replaced by three consecutive convolutions, which is a protocol widely adopted in semantic segmentation methods [5, 28, 46, 4, 14, 13].
Models are trained on ADE20K, COCO-Stuff-10K, and PASCAL-Context with 160k, 60k, and 40k-iteration schedules respectively, unless otherwise specified. For the ADE20K dataset, images are cropped after scale jittering, horizontal random flip, and color jittering. The same data augmentations are used for COCO-Stuff-10K dataset and PASCAL-Context dataset, while a crop size of is used for COCO-Stuff-10K and for PASCAL-Context. Scale jittering is set to between 0.5 and 2.0 of the crop sizes.
Same as , we use and to weight the focal loss and dice loss. We use for ADE20K and COCO-Stuff-10K and for PASCAL-Context. We ablate our as in Fig. 6 and choose from for all datasets. For focal-style cross-entropy loss’s hyperparameters, we use the default and . AdamW 
is used as our optimizer with a linear learning rate decay schedule. For ResNet backbones, they are pretrained on ImageNet-1K. We use a learning rateand a weight decay of . A learning rate multiplier is applied to the CNN backbones. The initial learning rate for Swin-Transformer backbones is set to and a weight decay of
is used. For Swin-T and Swin-S backbones, we use the official pretrained weights on ImageNet-1K with resolution. For Swin-B and Swin-L, we use the pretrained weights on ImageNet-22K with resolution. All models are trained on a single compute node with 8 NVIDIA Tesla V100 GPUs. See Sec. A.3 for a more detailed documentation of hyperparameters and experimental settings.
We use mean Intersection-over-Union (mIoU) as our evaluation metric for semantic segmentation performance. Both the single-scale and multi-scale inferences are reported in our experiments. For multi-scale inference, we apply horizontal flip and scales of 0.5, 0.75, 1.0, 1.25, 1.5, and 1.75.
4.2 Main results
|backbone type||method||backbone||pretraining||crop size||batchsize||schedule||mIoU (s.s.)||mIoU (m.s.)||#params.|
|PFD (ours)||R50||IM-1K||16||160k||45.6 (+1.1)||48.3 (+1.6)||74M|
|R101||IM-1K||16||160k||47.2 (+1.7)||49.4 (+2.2)||93M|
|R101c||IM-1K||16||160k||47.9 (+1.9)||49.4 (+1.3)||93M|
|PFD (ours)||Swin-T||IM-1K||16||160k||48.3 (+1.6)||49.6 (+0.8)||74M|
|Swin-S||IM-1K||16||160k||51.0 (+1.2)||52.2 (+1.2)||96M|
|Swin-B||IM-22K||16||160k||54.1 (+1.4)||55.3 (+1.4)||133M|
|Swin-L||IM-22K||16||160k||56.0 (+1.9)||57.2 (+1.6)||242M|
Results on ADE20K dataset.
Tab. 1 summarizes our results on ADE20K validation set. We report both results from single-scale and multi-scale inferences. As shown in the table, when paired with the same ResNet-50 backbone, PFT achieves 45.6 mIoU, improving over MaskFormer  by 1.1 mIoU and matching the accuracy obtained by  with a ResNet-101 backbone. With our largest CNN backbone ResNet-101c and single scale input, we achieve 47.9 mIoU, a 1.9 mIoU improvement over MaskFormer with the same encoder. The results with transformer backbones are consistent with those with CNNs, surpassing MaskFormer with the same backbones and pretraining. Notably, with a Swin-B backbone, we obtain an mIoU of 54.1, matching the result of the it with a much larger Swin-L backbone. Our best model achieves a 56.0 single-scale mIoU and 57.2 multi-scale mIoU, surpassing BEiT  with a more sophisticated pretraining scheme and achieving state-of-the-art result on the dataset.
|backbone type||method||backbone||pretraining||crop size||batchsize||schedule||mIoU (s.s.)||mIoU (m.s.)||#params.|
|PFD (ours)||R50||IM-1K||16||60k||38.4 (+1.3)||40.3 (+1.4)||74M|
|R50c||IM-1K||16||60k||39.5 (+1.8)||41.0 (+2.9)||74M|
|R101||IM-1K||16||60k||40.9 (+1.8)||42.1 (+2.3)||93M|
|R101c||IM-1K||16||60k||41.2 (+3.2)||42.3 (+3.0)||93M|
|PFD (ours)||Swin-T||IM-1K||16||160k||42.6 (+0.4)||42.8 (+0.3)||74M|
|Swin-S||IM-1K||16||160k||44.8 (+0.7)||45.3 (+0.3)||96M|
Results on COCO-Stuff-10K dataset.
We report our results on the COCO-Stuff-10K dataset in Tab. 2. As shown in the table, PFT obtains consistent improvements over MaskFormer by at least 1.3 mIoU margin with CNN backbones. Particularly, PFT with ResNet-101c backbone outperforms MaskFormer and achieves 41.2 mIoU, obtaining a performance gain of 3.2 mIoU compared to MaskFormer. With multi-scale inference, our model with a ResNet-101c backbone results in 42.3 mIoU, a 3.0 performance gain over the baseline with the same backbone network.
|backbone type||method||backbone||pretraining||crop size||batchsize||schedule||mIoU (s.s.)||mIoU (m.s.)||#params.|
|PFD (ours)||R50||IM-1K||16||40k||53.3 (+0.8)||54.8 (+0.7)||60M|
|R50c||IM-1K||16||40k||54.2 (+1.9)||55.8 (+1.9)||60M|
|R101||IM-1K||16||40k||54.6 (+0.9)||56.2 (+0.8)||79M|
|R101c||IM-1K||16||40k||55.5 (+2.4)||57.6 (+2.0)||79M|
Results on PASCAL-Context dataset.
Finally, we present our results from PFT trained on the PASCAL-Context dataset in Tab. 3. Our method beats MaskFormer with different backbone networks, showing steady improvements for the per-mask classification framework for semantic segmentation. Our most single-scale performance gain is obtained by the ResNet-101c backbone, which achieves 55.5 mIoU with an additional 16M parameters compared to MaskFormer, outperforming it by a 2.4 mIoU margin in single-scale testing. With multi-scale inference, our we obtain 57.6 mIoU with ResNet-101c.
4.3 Ablation studies
To evaluate the effectiveness of the components in our mutli-scale transformer decoder for semantic segmentation, we conduct ablations on the multi-scale design and the cross-scale inter-query attention layer. Furthermore, we vary the loss weights for the focal-style cross-entropy loss and show properly balanced supervisions on the probability predictions improve the performance of our framework. For the ablation studies, we train our model with a ResNet-50c backbone on the PASCAL-Context dataset and a Swin-T backbone on the ADE20K dataset. We set the training schedule to 40k and 160k iterations respectively and use a batch size of 16.
Multi-scale prediction for probability-mask pairs.
The predicted probability-mask pairs are averaged from the multiple scales as described in Sec. 3.1. To study whether including more feature scales is indeed beneficial for predicting probabilities and masks, we conduct experiments from two perspectives: varying the usage of multi-scale features for (1) probability prediction and (2) mask prediction. Specifically, the transformer decoder takes as input multi-scale features with spatial sizes for , but uses only a subset of the output query embeddings when producing the probabilities or masks during both training and inference.
When ablating the scales for predicting probabilities or masks, we keep the predicting scales for masks or probabilities as all the scales. The number of predicting scales is reduced from two directions for probabilities and masks respectively, i.e. from large to small scales and from small to large ones. As shown in Fig. 3, when using fewer scales to generate probabilities or masks, the performances of our PFT decrease, while the decline is more pronounced for probability prediction in both directions (small-large and large-small). It might be that the accuracy of probabilities matters more to the prediction quality than that of masks in the probability-mask segmentation framework.
Cross-resolution inter-query attention.
In our framework, we propose a novel cross-scale inter-query attention module to allow the multiple scales to propagate and aggregate useful information to other scales. To verify the benefit by such a module, we conduct experiments to remove the cross-resolution query attention. After such removal, our framework can be viewed as a multi-scale variant of MaskFormer  with fixed-matching between queries and categories. The results are reported in Fig. 4. On PASCAL-Context, when we remove the cross-resolution attention, the performance has a slight decrease ( mIoU). However, when the same model is applied to larger datasets, we observe a larger gaps in terms of mIoU scores when removing this component on COCO-Stuff-10K ( mIoU) and ADE20K ( mIoU).
Number of transformer layers.
As reported in , a decoder with only one transformer layer can already achieve reasonable performance. We have a similar observation from our ablations. PFT with a single transformer layer can produce reasonable results compared to other settings (Fig. 5) while stacking a total of 4 layers achieved the best performance on PASCAL-Context.
Loss weights for the focal-style cross-entropy loss.
To apply a strong supervision for more reliable probability predictions, we use a focal-style cross-entropy loss at each resolution’s probabilities as described in Sec. 3.2. To verify the efficacy of the supervision, we use different loss weights , where using a weight is equivalent to removing the loss. As shown in Fig. 6, applying this additional loss term is beneficial. Using gives a performance boost over 1.0 mIoU compared to not including the loss. Continuing increasing the loss weight does not lead to more growth in performance.
Parameter sharing for different modules.
Our multi-scale transformer adopts a parallel design, where we use different sets of parameter weights for different scales, including linear projections, MLPs, and query positional embeddings. A natural question to ask is whether weight sharing would affect the performance. We therefore conduct experiments on using the same weights for (1) transformer decoders, (2) linear projection layers for predicting probability logits, (3) MLPs for predicting mask, and (4) query positional embeddings of all scales. When sharing the weights of the four scales, we observe a 0.63 mIoU decrease for PFT. Analogous to this observation, when we share the linear projection layers for predicting probability logits (row 3), MLPs for predicting mask logits (row 4), or the query positional embeddings (row 5), the performances slightly decrease. Since the transformer decoders have the most parameters, using a shared decoder across all feature scales affects the performance the most. In our main experiments, we adopt non-weight-sharing decoders, MLPs, and query embeddings.
|transformer decoders||prob. lin. proj.||mask MLPs||query pos. embed.||mIoU|
We have presented Pyramid Fusion Decoder that aims to addresses the noisy prediction issues in the per-mask classification semantic segmentation paradigm with multi-scale feature inputs. With an embedding-level fusion module that operates on the latent semantic query embeddings and a prediction-level ensemble module that is directly applied at the predicted probability-mask level, PFT shows a steady improvements over the baseline MaskFormer model on various datasets. We hope that our approach will inspire the community to further the research in improving per-mask classification segmentation framework.
Appendix A Appendix
a.1 Focal-style cross-entropy loss
To enforce strong supervision on the probability logits predicted at each scale, we use a focal-style  cross-entropy loss as implemented in  to supervise the probabilities predicted at each scale . Specifically, within each scale
, we use a softmax layer on top ofto generate probabilities like did in Sec. 3.1. For probability from the -th position of , we use , where
a.2 Qualitative comparison of segmentation results
We compare semantic segmentation results from MaskFormer  and our PFT on the ADE20K  val set. Both models are trained with a Swin-T backbone on ADE20K. As shown in Fig. 7, our method produces better semantic segmentation maps compared to that of MaskFormer. For the same regions (red/yellow dashed boxes), PFT can produce segmentations with less categorical confusion. Besides, multi-scale inputs also make possible correcting wrong predictions for smaller objects (row 1, 2, and 5, yellow dashed boxes). Equipped with our multi-scale transformer decoder with cross-scale inter-query attention, PFT can generate high quality segmentation by capturing both global and fine-grained semantic information across the feature pyramid.
a.3 Hyperparameters and experimental settings
In this section we report our hyperparameters and experimental settings for the three datasets we use. We largely follow  in designing our experiments and the default settings are recorded in Tab. 5. For example, we use the same dice loss  and focal loss  as in , with the same loss weights and respectively. For Swin-T and Swin-S backbones trained on ADE20K, we use a for the focal-like cross-entropy loss applied at per-resolution predicted probabilities. Besides, we use a backbone learning rate multiplier of for the larger Swin-B and Swin-L backbones to avoid the risk of overfitting. For the smaller PASCAL-Context dataset, we reduce the number of decoder layers to for the same reason. From our experiments, we find that using a works best for PASCAL-Context with both the baseline model (MaskFormer) and our method, possibly due to the smaller number of classes in the dataset.
-  Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers, 2021.
-  Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
-  Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
-  Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
-  Bowen Cheng, Alexander G Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
-  MMSegmentation Contributors. MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
Henghui Ding, Hui Zhang, Jun Liu, Jiaxin Li, Zijian Feng, and Xudong Jiang.
Interaction via bi-directional graph of semantic region affinity for
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15848–15858, October 2021.
-  Xiaoyu Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows, 2021.
-  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
-  Adeel Hassan. Multi-class focal loss. https://github.com/AdeelH/pytorch-multi-class-focal-loss.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li.
Bag of tricks for image classification with convolutional neural networks.In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
-  Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two pure transformers can make one strong gan, and that can scale up. NeurIPS, 2021.
-  Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
-  Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019.
-  Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Shaohua Tan, and Yunhai Tong. Semantic flow for fast and accurate scene parsing. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 775–793, Cham, 2020. Springer International Publishing.
-  Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, Tong Lu, and Ping Luo. Panoptic segformer. arXiv preprint arXiv:2109.03814, 2021.
-  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017.
-  Jianbo Liu, Junjun He, Jiawei Zhang, Jimmy S Ren, and Hongsheng Li. EfficientFCN: Holistically-guided decoding for semantic segmentation. In European Conference on Computer Vision, pages 1–17. Springer, 2020.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019.
-  Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.
V-Net: Fully convolutional neural networks for volumetric medical image segmentation.In 3DV, 2016.
-  Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
-  Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
-  Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale attention for semantic segmentation. CoRR, abs/2005.10821, 2020.
-  Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
Sablayrolles, and Herve Jegou.
Training data-efficient image transformers & distillation through
International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
-  Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
-  Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid Vision Transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
-  Yu-Huan Wu, Yun Liu, Xin Zhan, and Ming-Ming Cheng. P2t: Pyramid pooling transformer for scene understanding. ArXiv, abs/2106.12011, 2021.
-  Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
-  Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, 2020.
-  Yuhui Yuan, Jingyi Xie, Xilin Chen, and Jingdong Wang. Segfix: Model-agnostic boundary refinement for segmentation. In European Conference on Computer Vision, pages 489–506. Springer, 2020.
-  Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. Feature pyramid transformer. In European Conference on Computer Vision (ECCV), 2020.
-  Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2998–3008, October 2021.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
-  Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
-  Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
-  Fangrui Zhu, Yi Zhu, Li Zhang, Chongruo Wu, Yanwei Fu, and Mu Li. A unified efficient pyramid transformer for semantic segmentation. In ICCV, pages 2667–2677, 2021.
-  Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.