Pyramid Fusion Transformer for Semantic Segmentation

The recently proposed MaskFormer <cit.> gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. The segmentation quality thus relies on how well the queries can capture the semantic information for categories and their spatial locations within the images. In our study, we find that per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probability or mask. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation on top of multi-scale features. To efficiently utilize image features of different resolutions without incurring too much computational overheads, PFT uses a multi-scale transformer decoder with cross-scale inter-query attention to exchange complimentary information. Extensive experimental evaluations and ablations demonstrate the efficacy of our framework. In particular, we achieve a 3.2 mIoU improvement on COCO-Stuff 10K dataset with ResNet-101c compared to MaskFormer. Besides, on ADE20K validation set, our result with Swin-B backbone matches that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.3 mIoU respectively. Using a Swin-L backbone, we achieve 56.0 mIoU single-scale result on the ADE20K validation set and 57.2 multi-scale result, obtaining state-of-the-art performance on the dataset.



There are no comments yet.


page 1

page 10


Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Multi-scale representations are crucial for semantic segmentation. The c...

Feature Selective Transformer for Semantic Image Segmentation

Recently, it has attracted more and more attentions to fuse multi-scale ...

EPSNet: Efficient Panoptic Segmentation Network with Cross-layer Attention Fusion

Panoptic segmentation is a scene parsing task which unifies semantic seg...

A^2-FPN: Attention Aggregation based Feature Pyramid Network for Instance Segmentation

Learning pyramidal feature representations is crucial for recognizing ob...

SpaceMeshLab: Spatial Context Memoization and Meshgrid Atrous Convolution Consensus for Semantic Segmentation

Semantic segmentation networks adopt transfer learning from image classi...

End-to-end speaker diarization with transformer

Speaker diarization is connected to semantic segmentation in computer vi...

Segmenter: Transformer for Semantic Segmentation

Image segmentation is often ambiguous at the level of individual image p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of semantic segmentation is to assign each pixel of an image with a semantic class label. Over the past decade, encoder-decoder based methods are the mainstream models to address this task. They usually use convolution- [26] or transformer-based [37]

networks to produce dense predictions from deep features generated by encoder networks. Efforts have been made to either design stronger backbone encoders 

[25, 39, 10, 12] or decoders [3, 50, 49, 47, 41, 5] for various downstream dense prediction tasks.

(a) Example of incorrect probability with correct mask
(b) Example of high correct probability with inferior mask
Figure 1: Examples of wrong predictions from MaskFormer. Left on fig:noisy_a&fig:noisy_b: input images. Top-right on fig:noisy_a&fig:noisy_b: predicted probabilities. Bottom-right on fig:noisy_a&fig:noisy_b: masks paired with the probabilities.

In general, the above-mentioned semantic segmentation methods follow a per-pixel classification formulation: the outputs of the models are spatial segmentation maps with categorical assignment for each pixel. Recently, Cheng et al. proposed MaskFormer [6] to advocate a pathbreaking replacement for the per-pixel formulation. It adopts a novel per-mask classification paradigm for semantic and panoptic segmentation [18]: instead of directly producing segmentation maps, the objective is to predict a set of paired probabilities and masks corresponding to category segments in an input image. To this end, MaskFormer uses a transformer decoder to obtain probability-mask pairs from a set of learnable object queries.

Despite the state-of-the-art performance of MaskFormer in semantic segmentation, we empirically find it is inclined to produce incorrect probability-mask pairs which degrade the quality of the prediction results. Specifically, we observe two types of prominent mistakes: (1) a query can produce a groundtruth look-alike mask but predicts wrong probability (see Fig. 0(a)), and (2) it can have a high probability for an in-groundtruth class but yields a corresponding mask with low quality (see Fig. 0(b)). Since MaskFormer decodes solely on the lowest resolution feature map ( of the input image size), such unsatisfactory results might be caused by insufficient semantic information in the single-scale feature. Intuitively, predicting object masks requires features containing dense information and rich spatial context, while predicting probabilities needs abstract categorical information. Due to the large variations of semantic information among different resolutions of features [21, 17], it is beneficial to take advantage of feature pyramid with multi-scale information for generating more accurate probability-mask pairs.

Multi-scale feature maps have been widely studied for many dense vision tasks such as detection and segmentation. [21, 35, 24] showed that multi-scale representations are beneficial for convolution-based object detection. Similarly, using multi-scale representations has also been widely explored in convolution-based segmentation frameworks [23, 32]. However, simply extending the techniques on multi-scale feature maps to transformer-based segmentation frameworks is non-trivial. A naive solution is to perform self-attention on the patch tokens from feature maps of all different scales. However, such an approach is too computationally demanding, as the computational cost of self-attention [37] is quadratic to the number of patch tokens of all scales. Existing works [20, 50] utilize a sparse attention mechanism to reduce the high computational cost, where each query only attends a sparse set of features for capturing image context. We argue that such sparse attention is sub-optimal for dense pixel prediction tasks, as they need to consider all pixels’ information for generating accurate dense predictions.

We propose to use a multi-scale transformer with cross-scale inter-query attention to effectively aggregate multi-scale information for segmentation. Specifically, we adopt a transformer decoder on the feature pyramid to produce segmentation predictions. As the semantic information at different scales can better capture different objects or structures, it is also important to communicate information across different feature scales  [21, 17, 50, 23, 34]. We therefore perform cross-scale communication by introducing a cross-scale inter-query attention

mechanism, which uses queries for information summarization and communication bridges. Such a mechanism also allows queries of different scales to be informed of semantic information of other scales without directly computing on them. The final prediction is the average of the per-scale predictions in the logit space. With the above proposed ingredients, we name our solution

Pyramid Fusion Transformer (PFT), a simple yet effective transformer-based segmentation transformer that efficaciously reasons from multi-scale feature maps with enhanced latent representations and consolidates predictions with high fidelity.

To demonstrate the effectiveness of our method, we conduct extensive experiments and ablation studies on three widely-used datasets and show state-of-the-art results. As shown in Tab. 1, with a Swin-B backbone, our model achieves mIoU on ADE20K dataset [48], matching the performance of MaskFormer with a much larger Swin-L backbone. When using the a Swin-L backbone, our model achieves 56.0 mIoU single-scale performance, which is competitive with MaskFormer’s under multi-scale inference. Furthermore, the same model obtains 57.2 multi-scale mIoU on ADE20K, achieving state-of-the-art result on the dataset.

2 Related Works

Vision Transformer: with the inspiration from success stories of transformer and attention mechanism [37] in NLP tasks [16], the vision community enjoys a recent surge of interests in adapting the transformer structures into solving various vision tasks [10, 15, 36, 3, 45]. The pioneering work [10] first proposes to “patchfiy” an image input into an unordered set of tokens, and sends them to a series of transformer layers consisting of self-attention modules and fully-connected layers with skip connections. On the task of image classification, it achieves competitive results with popular CNN networks such as ResNets [12]. To better adapt the transformer structure to dense vision tasks, researchers have designed various variants with stage-by-stage feature maps with shrinking resolutions [25, 39, 40, 9, 44]. Swin Transformer [25] is arguably the most representative of these transformers with feature pyramid. In stead of performing self-attention on the entire set of patch tokens, it uses window partitions as constraints on where the attention mechanism is applied and allows communication among windows by using a shifted configuration. As a result, it achieves high performance in several downstream vision tasks with reduced computational costs and memory consumption.

Per-mask classification segmentation: MaskFormer [6] and Max-DeepLab [38] are among the pioneering works to use a mask classification paradigm in place of the end-to-end per-pixel classification segmentation approach. Contrary to Fully Convolutional Networks (FCNs) that predict segmentation maps with per-pixel labelings, they produce paired results for masks and their corresponding class labels. In particular, Max-DeepLab adds a transformer module to each backbone convolutional block and performs self/cross-attention to allow communications between the backbone and the decode head. MaskFormer outputs probability logits for classes and yields mask embeddings for mask generation with a transformer on top of the resolution features. Recently, Li et al[20] proposes Panoptic Segformer, which shares a similar per-mask classification concept for semantic and panoptic segmentations. Specifically, it draws the predicted masks from the attention weights between location-aware queries and spatial features and predicts probabilities in a separate branch. During inference, the masks and probabilities are combined to generate the segmentation maps.

3 Method

Our overall pipeline is shown in Fig. 2. Our proposed Pyramid Fusion Transformer (PFT) takes a feature pyramid encoded by a backbone network as input and adopts a novel multi-scale transformer decoder with a cross-scale inter-query attention mechanism to efficiently fuse multi-scale information for accurate semantic segmentation.

The backbone network, which can either be a convolutional or transformer network, receives an input image of size

and produces a hierarchy of feature maps by a Feature Pyramid Network (FPN) with a uniform channel dimension . We flatten them into sequences of pixel tokens of length where respectively. Our PFT is applied on top of these flattened sequences obtained from multi-scale features. Each scale has a separate set of

queries to estimate the confidences and locations of

semantic categories, where each query is only responsible for capturing semantic information of one assigned category. Within the transformer, we recurrently stack three types of attention layers: (1) an intra-scale query self-attention layer that conducts conventional self-attention between queries within the same scale, (2) a novel cross-scale inter-query attention layer to efficiently communicate scale-aware information using the limited number of queries of the different scales, and (3) an intra-scale query-pixel cross-attention layer that aggregates semantic information from flattened sequences of pixel tokens.

Figure 2: Overview of Pyramid Fusion Transformer (PFT). PFT is composed of a backbone network with FPN to extract a hierarchy of latent representations and a set of parallel transformer decoders with cross-scale inter-query attention to process the features. We use queries within each scale, same as the number of categories in the dataset.

3.1 Pyramid Fusion Transformer with Cross-scale Inter-query Attention

Multi-scale information is important for achieving accurate scene understanding. Low-resolution feature maps are able to capture global context while high-resolution ones are better at discovering fine structures such as category boundaries 

[46, 17, 43, 31]. It is therefore vital to propagate information across the multiple scales to capture both global and find-grained information. However, due to the high computational cost of directly applying attention on the large number of multi-scale pixel tokens, previous transformer-based semantic segmentation methods such as [49, 20] often rely on the sparse attention mechanism on the pixel tokens to model pixel-to-pixel relations. Contrary to their approaches, to avoid heavy computation, we propose to efficiently fuse the multi-scale information with our proposed cross-scale inter-query attention mechanism. Instead of extracting global and local semantic information among the pixel tokens like the previous sparse attention-based methods, we propose to fuse the multi-scale information in the query embedding space. Three types of attention layers are recurrently stacked in our PFT to achieve the goal.

Intra-scale query self-attention.

Within each scale, the intra-scale query self-attention layer conducts self-attention only between category queries within the scale. Specifically, within each scale, the category queries with learnable positional encodings are input into the layer.


The category queries are zero-initialized at the first layer and updated by the stacked attention layers, while the learnable positional encodings are learned and shared at different depths. Following [3, 50, 6], the positional encodings are only used for encoding the embeddings for self-attention. The self-attention is conducted only between queries of the same scale to obtain the updated

. Such an intra-scale self-attention layer consists of the commonly used multi-head self-attention sub-layer and Feed-Forward Network (FFN) sub-layer with layer normalization and residual connection.

Cross-scale inter-query attention.

As the above intra-scale self-attention is limited within each scale, their sets of updated queries can only obtain information specific to each scale. To allow information propagation between the multiple scales for knowledge fusion, a novel cross-scale inter-query attention layer is introduced. As the number of such scale-aware queries in each scale is much smaller than the number of all visual tokens in each scale, we achieve information propagation across the multiple scales by conducting attention between the concatenated category queries of the four scales. The inter-query attention results are then sliced back to the queries of four scales, which serve as the input for the follow-up intra-scale query-pixel cross-attention layer.

To distinguish queries of different scales, we use the learnable positional encodings from the intra-scale query self-attention layer.


where the attention outputs a sequence of category queries of length and are sliced to sub-sequences of length and assigned to each for . In this way, the cross-scale attention and information communication is efficiently achieved by using the small number of category queries. The proposed cross-scale inter-query attention layer also consists of a multi-head attention sub-layer and an FFN sub-layer with layer normalization and residual connection following the classical design of the dot-product attention.

Intra-scale query-pixel cross-attention.

To aggregate dense pixel-level semantic information to the category queries, we conduct dense query-pixel cross-attention within each scale. Within each scale, the category queries perform cross-attention with the pixel tokens for using the multi-head attention.


where the learnable positional encodings are added to the category queries , and fixed sinusoidal positional encodings to the pixel tokens , because of the too large number of pixel tokens.

As illustrated by Fig. 2, in our pyramid fusion transformer, each transformer layer consists of the above three types of attention layers and is repeated times to form a -layer transformer. For intra-scale query attention and intra-scale query-pixel cross-attention, we use separate weights for the linear projection layers, layer normalization, etc. for each scale. The proposed cross-scale inter-query attention layers are placed between the intra-scale query self-attention layers and the intra-scale query-pixel cross-attention layers. The small number of category queries serve as the bridges to efficiently aggregate and propagate the pixel-level semantic information across the multiple scales. Neither intra-scale nor cross-scale pixel-to-pixel attention is used in our PFT, avoiding heavy computational cost of dense multi-scale information fusion.

Generating Segmentation Maps.

After the multiple transformer layers, the updated four sets of category queries with multi-scale information can be used to generate the probability-mask pairs for segmentation.

To generate probabilities of the categories for each scale, we use separate linear projections to map the category queries at the output of our PFT to

-dimensional binary probability logits. The probability logits of the multiple scales are averaged followed by a sigmoid function to generate the binary probabilities

, which denotes the confidence of each category existing in the input image.


To generate the binary category masks for the multiple scales, we first transform the spatial feature with four separate sets of convolutions to produce the spatial features for four scales . Each set of category queries for a scale

then goes through a Multi-Layer Perceptron and take a matrix product with its corresponding spatial feature

to produce the mask logits . The mask logits of the four scales are averaged followed by a sigmoid function to obtain the mask probability maps of each category.


where , , are of resolution of the input image, and stands for matrix multiplication.

For each category ’s probability and mask, , is from the -th item of , and is from the -th channel of . The semantic segmentation map is obtained by probability-mask marginalization as that in MaskFormer [6], where the category prediction for pixel is computed as


3.2 Training losses

To train our model, we adopt classification loss and mask loss to optimize the probabilities and masks. To construct the groundtruth for per-mask classification semantic segmentation, we decompose the segmentation map for each image into a set of groundtruth label-mask pairs , where is the categorical label for the binary mask that corresponds to all pixels belonging to the category. Here, is the number of categories present in the image, which is usually smaller than

and therefore we pad the groundtruth set with “not-exist”

. During training, the probability-mask pair produced by the -th query from is matched to the groundtruth label-mask pair whose or matched to if the category is absent from the image.

The classification loss consist of two parts: a binary cross-entropy loss applied at the averaged probability logits from the four decoding scales and a focal-style [22] binary cross-entropy loss at each decoder outputs . If category is present in the input image, the -th query’s probability is expected to predict a high value for it, and therefore . Otherwise, we maximize the probability for predicting by using . Here, represents the absence of the category. A similar formulation is adopted for the focal-style classification loss at each scale, while we adaptively reweight the hard samples following [22]. The detailed formulation of the focal-style binary cross-entropy loss can be found at Sec. A.1. Then, the classification loss is a linear combination of the above two losses , where and

are hyperparameters balancing the two terms.

To construct our mask loss , we use the same binary focal loss [22] and dice loss [29] in MaskFormer [6], where . Note that the mask loss is only applied at the averaged logits from the four scales and optimizes masks with groundtruth categories only. Masks corresponding to are simply discarded during training.

Our final training loss is a sum of the classification loss and the mask loss, where


4 Experiments

In this section, we demonstrate the effectiveness of our method with competitive semantic segmentation results and compare to both state-of-the-art per-pixel classification and mask-level classification frameworks on three popular segmentation datasets, ADE20K [48], COCO-Stuff-10K [2]

, and PASCAL-Context 

[30]. We choose MaskFormer [6] as our baseline model because of its strong performance among the mask-level classification methods [6, 38, 20]. In the ablations, we further study the effectiveness of our proposed components, including usage of multi-scale features, cross-scale inter-query attention, loss designs, and model parameter sharing. Experimental results demonstrate that our model can learn useful information from multi-scale feature maps to deliver high quality segmentation maps with our proposed per-scale transformer decoders and cross-scale inter-query attention module.

4.1 Datasets and Implementation details


ADE20K [48] is a semantic segmentation dataset with 150 fine-grained semantic categories, including thing and stuff. It contains 20,210 images for training, 2,000 images for validation and 3,352 images for testing. COCO-Stuff-10K [2] is a scene parsing dataset with 171 categories, not counting the class “unlabeled”. We follow the official split to partition the dataset into 9k images for training and 1k images for validation. PASCAL-Context [30] contains pixel-level annotations for the whole scenes with 4,998 images for training and 5,105 images for testing. We evaluate our method on the commonly used 59 classes of the dataset.

Implementation details.

We use the open-source segmentation codebase

mmsegmentation [7] to implement PFT. We adopt Swin Transformer [25] and ResNet [12] as backbone networks for evaluation. For ResNets, we report results obtained with ResNet-50 and ResNet-101, along with its slightly modified version ResNet-101c. ResNet-101c has its stem convolution layer replaced by three consecutive convolutions, which is a protocol widely adopted in semantic segmentation methods [5, 28, 46, 4, 14, 13].

Training settings.

Models are trained on ADE20K, COCO-Stuff-10K, and PASCAL-Context with 160k, 60k, and 40k-iteration schedules respectively, unless otherwise specified. For the ADE20K dataset, images are cropped after scale jittering, horizontal random flip, and color jittering. The same data augmentations are used for COCO-Stuff-10K dataset and PASCAL-Context dataset, while a crop size of is used for COCO-Stuff-10K and for PASCAL-Context. Scale jittering is set to between 0.5 and 2.0 of the crop sizes.

Same as [6], we use and to weight the focal loss and dice loss. We use for ADE20K and COCO-Stuff-10K and for PASCAL-Context. We ablate our as in Fig. 6 and choose from for all datasets. For focal-style cross-entropy loss’s hyperparameters, we use the default and  [22]. AdamW [27]

is used as our optimizer with a linear learning rate decay schedule. For ResNet backbones, they are pretrained on ImageNet-1K. We use a learning rate

and a weight decay of . A learning rate multiplier is applied to the CNN backbones. The initial learning rate for Swin-Transformer backbones is set to and a weight decay of

is used. For Swin-T and Swin-S backbones, we use the official pretrained weights on ImageNet-1K 

[33] with resolution. For Swin-B and Swin-L, we use the pretrained weights on ImageNet-22K with resolution. All models are trained on a single compute node with 8 NVIDIA Tesla V100 GPUs. See  Sec. A.3 for a more detailed documentation of hyperparameters and experimental settings.

Evaluation settings.

We use mean Intersection-over-Union (mIoU) as our evaluation metric for semantic segmentation performance. Both the single-scale and multi-scale inferences are reported in our experiments. For multi-scale inference, we apply horizontal flip and scales of 0.5, 0.75, 1.0, 1.25, 1.5, and 1.75.

4.2 Main results

backbone type method backbone pretraining crop size batchsize schedule mIoU (s.s.) mIoU (m.s.) #params.
CNN OCRNet [42] R101c IM-1K 16 150k - 45.3 -
GRAr [8] R101c IM-1K 16 200k - 47.1 -
DeepLabV3+ [5] 0R50c IM-1K 16 160k 44.0 44.9 044M
R101c IM-1K 16 160k 45.5 46.4 063M
MaskFormer [6] 0R50 IM-1K 16 160k 44.5 46.7 041M
R101 IM-1K 16 160k 45.5 47.2 060M
R101c IM-1K 16 160k 46.0 48.1 060M
PFD (ours) 0R50 IM-1K 16 160k 45.6 (+1.1) 48.3 (+1.6) 074M
R101 IM-1K 16 160k 47.2 (+1.7) 49.4 (+2.2) 093M
R101c IM-1K 16 160k 47.9 (+1.9) 49.4 (+1.3) 093M
Transformer BEiT [1] ViT-L IM-22K 16 160k 56.7 57.0 441M
SETR [47] ViT-L IM-22K 16 160k 48.6 50.3 308M
MaskFormer [6] Swin-T IM-1K 16 160k 46.7 48.8 042M
Swin-S IM-1K 16 160k 49.8 51.0 063M
Swin-B IM-22K 16 160k 52.7 53.9 102M
Swin-L IM-22K 16 160k 54.1 55.6 212M
PFD (ours) Swin-T IM-1K 16 160k 48.3 (+1.6) 49.6 (+0.8) 074M
Swin-S IM-1K 16 160k 51.0 (+1.2) 52.2 (+1.2) 096M
Swin-B IM-22K 16 160k 54.1 (+1.4) 55.3 (+1.4) 133M
Swin-L IM-22K 16 160k 56.0 (+1.9) 57.2 (+1.6) 242M
Table 1: Experiments on the ADE20K dataset. Results reported on ADE20K validation set. s.s.: single-scale inference. m.s.: multi-scale inference. Improvements over the baseline model (MaskFormer) are reported in the gray brackets.

Results on ADE20K dataset.

Tab. 1 summarizes our results on ADE20K validation set. We report both results from single-scale and multi-scale inferences. As shown in the table, when paired with the same ResNet-50 backbone, PFT achieves 45.6 mIoU, improving over MaskFormer [6] by 1.1 mIoU and matching the accuracy obtained by [6] with a ResNet-101 backbone. With our largest CNN backbone ResNet-101c and single scale input, we achieve 47.9 mIoU, a 1.9 mIoU improvement over MaskFormer with the same encoder. The results with transformer backbones are consistent with those with CNNs, surpassing MaskFormer with the same backbones and pretraining. Notably, with a Swin-B backbone, we obtain an mIoU of 54.1, matching the result of the it with a much larger Swin-L backbone. Our best model achieves a 56.0 single-scale mIoU and 57.2 multi-scale mIoU, surpassing BEiT [1] with a more sophisticated pretraining scheme and achieving state-of-the-art result on the dataset.

backbone type method backbone pretraining crop size batchsize schedule mIoU (s.s.) mIoU (m.s.) #params.
CNN OCRNet [42] R101c IM-1K 16 160k - 39.5 -
GRAr [8] R101c IM-1K 16 100k - 41.9 -
MaskFormer [6] 0R50 IM-1K 16 160k 37.1 38.9 44M
0R50c IM-1K 32 160k 37.7 38.1 44M
R101 IM-1K 32 160k 39.1 39.8 63M
R101c IM-1K 32 160k 38.0 39.3 63M
PFD (ours) 0R50 IM-1K 16 160k 38.4 (+1.3) 40.3 (+1.4) 74M
0R50c IM-1K 16 160k 39.5 (+1.8) 41.0 (+2.9) 74M
R101 IM-1K 16 160k 40.9 (+1.8) 42.1 (+2.3) 93M
R101c IM-1K 16 160k 41.2 (+3.2) 42.3 (+3.0) 93M
Transformer MaskFormer [6] Swin-T IM-1K 16 160k 42.2 42.5 42M
Swin-S IM-1K 16 160k 44.1 45.0 63M
PFD (ours) Swin-T IM-1K 16 160k 42.6 (+0.4) 42.8 (+0.3) 74M
Swin-S IM-1K 16 160k 44.8 (+0.7) 45.3 (+0.3) 96M
Table 2: Experiments on the COCO-Stuff-10K dataset. Results reported on the val set. s.s.: single-scale inference. m.s.: multi-scale inference. Results produced by our re-implementation are marked with . Improvements over the baseline model (MaskFormer) are reported in the gray brackets.

Results on COCO-Stuff-10K dataset.

We report our results on the COCO-Stuff-10K dataset in Tab. 2. As shown in the table, PFT obtains consistent improvements over MaskFormer by at least 1.3 mIoU margin with CNN backbones. Particularly, PFT with ResNet-101c backbone outperforms MaskFormer and achieves 41.2 mIoU, obtaining a performance gain of 3.2 mIoU compared to MaskFormer. With multi-scale inference, our model with a ResNet-101c backbone results in 42.3 mIoU, a 3.0 performance gain over the baseline with the same backbone network.

backbone type method backbone pretraining crop size batchsize schedule mIoU (s.s.) mIoU (m.s.) #params.
CNN SFNet [19] 0R50c IM-1K 16 138k - 50.7 -
R101c IM-1K 16 138k - 53.8 -
GRAr [8] R101c IM-1K 16 150k - 55.7 -
MaskFormer [6] 0R50 IM-1K 16 140k 52.5 54.1 044M
0R50c IM-1K 16 140k 52.3 53.9 044M
R101 IM-1K 16 140k 53.7 55.4 063M
R101c IM-1K 16 140k 53.1 55.6 063M
PFD (ours) 0R50 IM-1K 16 140k 53.3 (+0.8) 54.8 (+0.7) 060M
0R50c IM-1K 16 140k 54.2 (+1.9) 55.8 (+1.9) 060M
R101 IM-1K 16 140k 54.6 (+0.9) 56.2 (+0.8) 079M
R101c IM-1K 16 140k 55.5 (+2.4) 57.6 (+2.0) 079M
Table 3: Experiments on the PASCAL-Context dataset. Results reported on PASCAL-Context validation set with 59 categories. s.s.: single-scale inference. m.s.: multi-scale inference. Results produced by our re-implementation are marked with . Improvements over the baseline model (MaskFormer) are reported in the gray brackets.

Results on PASCAL-Context dataset.

Finally, we present our results from PFT trained on the PASCAL-Context dataset in Tab. 3. Our method beats MaskFormer with different backbone networks, showing steady improvements for the per-mask classification framework for semantic segmentation. Our most single-scale performance gain is obtained by the ResNet-101c backbone, which achieves 55.5 mIoU with an additional 16M parameters compared to MaskFormer, outperforming it by a 2.4 mIoU margin in single-scale testing. With multi-scale inference, our we obtain 57.6 mIoU with ResNet-101c.

Figure 3: Ablation on multi-scale prediction. Top: ablating from large to small scales. Bottom: ablating from small to large scales. When the number of predicting scales for probability is reduced, mask predicting still uses all four scales and vice versa.

4.3 Ablation studies

To evaluate the effectiveness of the components in our mutli-scale transformer decoder for semantic segmentation, we conduct ablations on the multi-scale design and the cross-scale inter-query attention layer. Furthermore, we vary the loss weights for the focal-style cross-entropy loss and show properly balanced supervisions on the probability predictions improve the performance of our framework. For the ablation studies, we train our model with a ResNet-50c backbone on the PASCAL-Context dataset and a Swin-T backbone on the ADE20K dataset. We set the training schedule to 40k and 160k iterations respectively and use a batch size of 16.

Multi-scale prediction for probability-mask pairs.

The predicted probability-mask pairs are averaged from the multiple scales as described in Sec. 3.1. To study whether including more feature scales is indeed beneficial for predicting probabilities and masks, we conduct experiments from two perspectives: varying the usage of multi-scale features for (1) probability prediction and (2) mask prediction. Specifically, the transformer decoder takes as input multi-scale features with spatial sizes for , but uses only a subset of the output query embeddings when producing the probabilities or masks during both training and inference.

When ablating the scales for predicting probabilities or masks, we keep the predicting scales for masks or probabilities as all the scales. The number of predicting scales is reduced from two directions for probabilities and masks respectively, i.e. from large to small scales and from small to large ones. As shown in Fig. 3, when using fewer scales to generate probabilities or masks, the performances of our PFT decrease, while the decline is more pronounced for probability prediction in both directions (small-large and large-small). It might be that the accuracy of probabilities matters more to the prediction quality than that of masks in the probability-mask segmentation framework.

Cross-resolution inter-query attention.

In our framework, we propose a novel cross-scale inter-query attention module to allow the multiple scales to propagate and aggregate useful information to other scales. To verify the benefit by such a module, we conduct experiments to remove the cross-resolution query attention. After such removal, our framework can be viewed as a multi-scale variant of MaskFormer [6] with fixed-matching between queries and categories. The results are reported in Fig. 4. On PASCAL-Context, when we remove the cross-resolution attention, the performance has a slight decrease ( mIoU). However, when the same model is applied to larger datasets, we observe a larger gaps in terms of mIoU scores when removing this component on COCO-Stuff-10K ( mIoU) and ADE20K ( mIoU).

Figure 4: Performances w/ and w/o cross-scale inter-query attention. Results obtained with ResNet-50c backbone on three datasets with various sizes and complexities.

Number of transformer layers.

As reported in [6], a decoder with only one transformer layer can already achieve reasonable performance. We have a similar observation from our ablations. PFT with a single transformer layer can produce reasonable results compared to other settings (Fig. 5) while stacking a total of 4 layers achieved the best performance on PASCAL-Context.

Loss weights for the focal-style cross-entropy loss.

To apply a strong supervision for more reliable probability predictions, we use a focal-style cross-entropy loss at each resolution’s probabilities as described in Sec. 3.2. To verify the efficacy of the supervision, we use different loss weights , where using a weight is equivalent to removing the loss. As shown in Fig. 6, applying this additional loss term is beneficial. Using gives a performance boost over 1.0 mIoU compared to not including the loss. Continuing increasing the loss weight does not lead to more growth in performance.

Figure 5: Ablations on number of transformer layers within each per-scale transformer decoder.
Figure 6: Loss weights for focal-style cross-entropy loss. We choose and train our model under the same setting on PASCAL Context dataset. We pick as our candidate parameters given the results.

Parameter sharing for different modules.

Our multi-scale transformer adopts a parallel design, where we use different sets of parameter weights for different scales, including linear projections, MLPs, and query positional embeddings. A natural question to ask is whether weight sharing would affect the performance. We therefore conduct experiments on using the same weights for (1) transformer decoders, (2) linear projection layers for predicting probability logits, (3) MLPs for predicting mask, and (4) query positional embeddings of all scales. When sharing the weights of the four scales, we observe a 0.63 mIoU decrease for PFT. Analogous to this observation, when we share the linear projection layers for predicting probability logits (row 3), MLPs for predicting mask logits (row 4), or the query positional embeddings (row 5), the performances slightly decrease. Since the transformer decoders have the most parameters, using a shared decoder across all feature scales affects the performance the most. In our main experiments, we adopt non-weight-sharing decoders, MLPs, and query embeddings.

transformer decoders prob. lin. proj. xxxmaskxxx MLPs query pos. embed. mIoU
weight sharing 54.06
Table 4: Ablations on weight sharing for different modules including (1) transformer decoders: per-scale transformer decoders, (2) prob. lin. proj.: linear projection layers to produce probability logits, (3) mask MLPs: MLPs on query embeddings before mask logits, and (4) query pos. embed.: query positional embeddings across all scales.

5 Conclusion

We have presented Pyramid Fusion Decoder that aims to addresses the noisy prediction issues in the per-mask classification semantic segmentation paradigm with multi-scale feature inputs. With an embedding-level fusion module that operates on the latent semantic query embeddings and a prediction-level ensemble module that is directly applied at the predicted probability-mask level, PFT shows a steady improvements over the baseline MaskFormer model on various datasets. We hope that our approach will inspire the community to further the research in improving per-mask classification segmentation framework.

Appendix A Appendix

a.1 Focal-style cross-entropy loss

To enforce strong supervision on the probability logits predicted at each scale, we use a focal-style [22] cross-entropy loss as implemented in [11] to supervise the probabilities predicted at each scale . Specifically, within each scale

, we use a softmax layer on top of

to generate probabilities like did in Sec. 3.1. For probability from the -th position of , we use , where


Here, and are hyperparameters used in focal loss [22]. Following [22], we use and for all our experiments.

a.2 Qualitative comparison of segmentation results

We compare semantic segmentation results from MaskFormer [6] and our PFT on the ADE20K [48] val set. Both models are trained with a Swin-T backbone on ADE20K. As shown in Fig. 7, our method produces better semantic segmentation maps compared to that of MaskFormer. For the same regions (red/yellow dashed boxes), PFT can produce segmentations with less categorical confusion. Besides, multi-scale inputs also make possible correcting wrong predictions for smaller objects (row 1, 2, and 5, yellow dashed boxes). Equipped with our multi-scale transformer decoder with cross-scale inter-query attention, PFT can generate high quality segmentation by capturing both global and fine-grained semantic information across the feature pyramid.

(a) Input image
(b) groundtruth
(c) MaskFormer [6]
(d) PFD (ours)
Figure 7: Examples of semantic segmentation results from ADE20K [48] val set.

a.3 Hyperparameters and experimental settings

In this section we report our hyperparameters and experimental settings for the three datasets we use. We largely follow [6] in designing our experiments and the default settings are recorded in Tab. 5. For example, we use the same dice loss [29] and focal loss [22] as in [6], with the same loss weights and respectively. For Swin-T and Swin-S backbones trained on ADE20K, we use a for the focal-like cross-entropy loss applied at per-resolution predicted probabilities. Besides, we use a backbone learning rate multiplier of for the larger Swin-B and Swin-L backbones to avoid the risk of overfitting. For the smaller PASCAL-Context dataset, we reduce the number of decoder layers to for the same reason. From our experiments, we find that using a works best for PASCAL-Context with both the baseline model (MaskFormer) and our method, possibly due to the smaller number of classes in the dataset.

ResNet Swin
learning rate
weight decay
# of decoder layers
focal-like cross-entropy loss
, ,
 [6, 29]
 [6, 22]
backbone lr multiplier
(a) Training settings for ADE20K
ResNet Swin
learning rate
weight decay
# of decoder layers
focal-like cross-entropy loss
, ,
backbone lr multiplier
(b) Training settings for COCO-Stuff-10K
ResNet Swin
learning rate -
weight decay -
# of decoder layers -
focal-like cross-entropy loss -
backbone lr multiplier -
(c) Training settings for PASCAL-Context
Table 5: Training settings for ADE20K [48], COCO-Stuff-10K [2], and PASCAL-Context [30]. We mostly follow the setups for hyperparameters reported in [6].


  • [1] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers, 2021.
  • [2] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  • [3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • [4] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587, 2017.
  • [5] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  • [6] Bowen Cheng, Alexander G Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. NeurIPS, 2021.
  • [7] MMSegmentation Contributors. MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark., 2020.
  • [8] Henghui Ding, Hui Zhang, Jun Liu, Jiaxin Li, Zijian Feng, and Xudong Jiang. Interaction via bi-directional graph of semantic region affinity for scene parsing. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    , pages 15848–15858, October 2021.
  • [9] Xiaoyu Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows, 2021.
  • [10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • [11] Adeel Hassan. Multi-class focal loss.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [13] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li.

    Bag of tricks for image classification with convolutional neural networks.


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pages 558–567, 2019.
  • [14] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
  • [15] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two pure transformers can make one strong gan, and that can scale up. NeurIPS, 2021.
  • [16] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  • [17] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollar. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [18] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. Panoptic segmentation. In CVPR, 2019.
  • [19] Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Shaohua Tan, and Yunhai Tong. Semantic flow for fast and accurate scene parsing. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 775–793, Cham, 2020. Springer International Publishing.
  • [20] Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, Tong Lu, and Ping Luo. Panoptic segformer. arXiv preprint arXiv:2109.03814, 2021.
  • [21] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, 2017.
  • [23] Jianbo Liu, Junjun He, Jiawei Zhang, Jimmy S Ren, and Hongsheng Li. EfficientFCN: Holistically-guided decoding for semantic segmentation. In European Conference on Computer Vision, pages 1–17. Springer, 2020.
  • [24] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. SSD: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [25] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
  • [26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [27] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ICLR, 2019.
  • [28] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. ShuffleNet V2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  • [29] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.

    V-Net: Fully convolutional neural networks for volumetric medical image segmentation.

    In 3DV, 2016.
  • [30] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
  • [31] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
  • [32] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [33] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [34] Andrew Tao, Karan Sapra, and Bryan Catanzaro. Hierarchical multi-scale attention for semantic segmentation. CoRR, abs/2005.10821, 2020.
  • [35] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9627–9636, 2019.
  • [36] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. In

    International Conference on Machine Learning

    , volume 139, pages 10347–10357, July 2021.
  • [37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • [38] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. MaX-DeepLab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
  • [39] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid Vision Transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021.
  • [40] Yu-Huan Wu, Yun Liu, Xin Zhan, and Ming-Ming Cheng. P2t: Pyramid pooling transformer for scene understanding. ArXiv, abs/2106.12011, 2021.
  • [41] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  • [42] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, 2020.
  • [43] Yuhui Yuan, Jingyi Xie, Xilin Chen, and Jingdong Wang. Segfix: Model-agnostic boundary refinement for segmentation. In European Conference on Computer Vision, pages 489–506. Springer, 2020.
  • [44] Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. Feature pyramid transformer. In European Conference on Computer Vision (ECCV), 2020.
  • [45] Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2998–3008, October 2021.
  • [46] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  • [47] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  • [48] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127(3):302–321, 2019.
  • [49] Fangrui Zhu, Yi Zhu, Li Zhang, Chongruo Wu, Yanwei Fu, and Mu Li. A unified efficient pyramid transformer for semantic segmentation. In ICCV, pages 2667–2677, 2021.
  • [50] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.