Per-Pixel Classification is Not All You Need for Semantic Segmentation

07/13/2021 ∙ by Bowen Cheng, et al. ∙ 6

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 8

page 9

page 10

page 11

page 12

page 15

Code Repositories

MaskFormer

Per-Pixel Classification is Not All You Need for Semantic Segmentation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of semantic segmentation is to partition an image into regions with different semantic categories. Starting from Fully Convolutional Networks (FCNs) work of Long et allong2015fully, most deep learning-based semantic segmentation approaches formulate semantic segmentation as per-pixel classification (Figure 1 left), applying a classification loss to each output pixel deeplabV3plus; zhao2017pspnet. Per-pixel predictions in this formulation naturally partition an image into regions of different classes.

Mask classification is an alternative paradigm that disentangles the image partitioning and classification aspects of segmentation. Instead of classifying each pixel, mask classification-based methods predict a set of binary masks, each associated with a

single class prediction (Figure 1 right). The more flexible mask classification dominates the field of instance-level segmentation. Both Mask R-CNN he2017mask and DETR detr yield a single class prediction per segment for instance and panoptic segmentation. In contrast, per-pixel classification assumes a static number of outputs and cannot return a variable number of predicted regions/segments, which is required for instance-level tasks.

Figure 1: Per-pixel classification vs. mask classification. (left) Semantic segmentation with per-pixel classification applies the same classification loss to each location. (right) Mask classification predicts a set of binary masks and assigns a single class to each mask. Each prediction is supervised with a per-pixel binary mask loss and a classification loss. Matching between the set of predictions and ground truth segments can be done either via bipartite matching similarly to DETR detr or by fixed matching via direct indexing if the number of predictions and classes match, i.e., if .

Our key observation: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks. In fact, before FCN long2015fully, the best performing semantic segmentation methods like O2P carreira2012semantic and SDS hariharan2014simultaneous used a mask classification formulation. Given this perspective, a natural question emerges: can a single mask classification model simplify the landscape of effective approaches to semantic- and instance-level segmentation tasks? And can such a mask classification model outperform existing per-pixel classification methods for semantic segmentation?

To address both questions we propose a simple MaskFormer approach that seamlessly converts any existing per-pixel classification model into a mask classification. Using the set prediction mechanism proposed in DETR detr, MaskFormer employs a Transformer decoder vaswani2017attention

to compute a set of pairs, each consisting of a class prediction and a mask embedding vector. The mask embedding vector is used to get the binary mask prediction via a dot product with the per-pixel embedding obtained from an underlying fully-convolutional network. The new model solves both semantic- and instance-level segmentation tasks in a unified manner: no changes to the model, losses, and training procedure are required. Specifically, for semantic and panoptic segmentation tasks alike, MaskFormer is supervised with the same per-pixel binary mask loss and a single classification loss per mask. Finally, we design a simple inference strategy to blend MaskFormer outputs into a task-dependent prediction format.

We evaluate MaskFormer on five semantic segmentation datasets with various numbers of categories: Cityscapes 

Cordts2016Cityscapes (19 classes), Mapillary Vistas neuhold2017mapillary (65 classes), ADE20K zhou2017ade20k (150 classes), COCO-Stuff-10K caesar2016coco (171 classes), ADE20K-Full zhou2017ade20k (847 classes). While MaskFormer performs on par with per-pixel classification models for Cityscapes, which has a few diverse classes, the new model demonstrates superior performance for datasets with larger vocabulary. We hypothesize that a single class prediction per mask models fine-grained recognition better than per-pixel class predictions. MaskFormer achieves the new state-of-the-art on ADE20K (55.6 mIoU) with Swin-Transformer liu2021swin backbone, outperforming a per-pixel classification model liu2021swin with the same backbone by 2.1 mIoU, while being more efficient (10% reduction in parameters and 40% reduction in FLOPs).

Finally, we study MaskFormer’s ability to solve instance-level tasks using two panoptic segmentation datasets: COCO lin2014coco; kirillov2017panoptic and ADE20K zhou2017ade20k. MaskFormer outperforms a more complex DETR model detr with the same backbone and the same post-processing. Moreover, MaskFormer achieves the new state-of-the-art on COCO (52.7 PQ), outperforming prior state-of-the-art wang2021max by 1.6 PQ. Our experiments highlight MaskFormer’s ability to unify instance- and semantic-level segmentation.

2 Related Works

Both per-pixel classification and mask classification have been extensively studied for semantic segmentation. In early work, Konishi and Yuille Konishi apply per-pixel Bayesian classifiers based on local image statistic. Then, inspired by early works on non-semantic groupings Comaniciu1997; ShiPAMI2000

, mask classification-based methods became popular demonstrating the best performance in PASCAL VOC challenges 

everingham2015pascal. Methods like O2P carreira2012semantic and CFM dai2015convolutional have achieved state-of-the-art results by classifying mask proposals carreira2011cpmc; uijlings2013selective; arbelaez2014multiscale. In 2015, FCN long2015fully

extended the idea of per-pixel classification to deep nets, significantly outperforming all prior methods on mIoU (a per-pixel evaluation metric which particularly suits the per-pixel classification formulation of segmentation).

Per-pixel classification became the dominant way for deep-net-based semantic segmentation since the seminal work of Fully Convolutional Networks (FCNs) long2015fully. Modern semantic segmentation models focus on aggregating long-range context in the final feature map: ASPP deeplabV2; deeplabV3 uses atrous convolutions with different atrous rates; PPM zhao2017pspnet uses pooling operators with different kernel sizes; DANet fu2019dual, OCNet yuan2018ocnet, and CCNet huang2019ccnet use different variants of non-local blocks wang2018non. Recently, SETR zheng2021rethinking and Segmenter strudel2021segmenter replace traditional convolutional backbones with Vision Transformers (ViT) dosovitskiy2020vit that capture long-range context starting from the very first layer. However, these concurrent Transformer-based vaswani2017attention semantic segmentation approaches still use a per-pixel classification formulation. Note, that our MaskFormer module can convert any per-pixel classification model to the mask classification setting, allowing seamless adoption of advances in per-pixel classification.

Mask classification is commonly used for instance-level segmentation tasks hariharan2014simultaneous; kirillov2017panoptic. These tasks require a dynamic number of predictions, making application of per-pixel classification challenging as it assumes a static number of outputs. Omnipresent Mask R-CNN he2017mask uses a global classifier to classify mask proposals for instance segmentation. DETR detr further incorporates a Transformer vaswani2017attention design to handle thing and stuff segmentation simultaneously for panoptic segmentation kirillov2017panoptic. However, these mask classification methods require predictions of bounding boxes, which may limit their usage in semantic segmentation. The recently proposed Max-DeepLab wang2021max removes the dependence on box predictions for panoptic segmentation with conditional convolutions tian2020conditional; wang2020solov2. However, in addition to the main mask classification losses it requires multiple auxiliary losses (i.e., instance discrimination loss, mask-ID cross entropy loss, and the standard per-pixel classification loss).

3 From Per-Pixel to Mask Classification

In this section, we first describe how semantic segmentation can be formulated as either a per-pixel classification or a mask classification problem. Then, we introduce our instantiation of the mask classification model with the help of a Transformer decoder vaswani2017attention. Finally, we describe simple inference strategies to transform mask classification outputs into task-dependent prediction formats.

3.1 Per-pixel classification formulation

For per-pixel classification, a segmentation model aims to predict the probability distribution over all possible

categories for every pixel of an image: . Here is the

-dimensional probability simplex. Training a per-pixel classification model is straight-forward: given ground truth category labels

for every pixel, a per-pixel cross-entropy (negative log-likelihood) loss is usually applied, i.e., .

3.2 Mask classification formulation

Mask classification splits the segmentation task into 1) partitioning/grouping the image into regions ( does not need to equal ), represented with binary masks ; and 2) associating each region as a whole with some distribution over categories. To jointly group and classify a segment, i.e., to perform mask classification, we define the desired output as a set of probability-mask pairs, i.e., In contrast to per-pixel class probability prediction, for mask classification the probability distribution contains an auxiliary “no object” label () in addition to the category labels. The label is predicted for masks that do not correspond to any of the categories. Note, mask classification allows multiple mask predictions with the same associated class, making it applicable to both semantic- and instance-level segmentation tasks.

To train a mask classification model, a matching between the set of predictions and the set of ground truth segments is required.222Different mask classification methods utilize various matching rules. For instance, Mask R-CNN he2017mask

uses a heuristic procedure based on anchor boxes and DETR 

detr optimizes a bipartite matching between and . Here is the ground truth class of the ground truth segment. Since the size of prediction set and ground truth set generally differ, we assume

and pad the set of ground truth labels with “no object” tokens

to allow one-to-one matching.

For semantic segmentation, a trivial fixed matching is possible if the number of predictions matches the number of category labels . In this case, the prediction is matched to a ground truth region with class label and to if a region with class label is not present in the ground truth. In our experiments, we found that a bipartite matching-based assignment demonstrates better results than the fixed matching. Unlike DETR detr that uses bounding boxes to compute the assignment costs between prediction and ground truth for the matching problem, we directly use class and mask predictions, i.e., , where is a binary mask loss.

To train model parameters, given a matching, the main mask classification loss is composed of a cross-entropy classification loss and a binary mask loss for each predicted segment:

(1)

Note, that most existing mask classification models use auxiliary losses (e.g., a bounding box loss he2017mask; detr or an instance discrimination loss wang2021max) in addition to . In the next section we present a simple mask classification model that allows end-to-end training with alone.

3.3 MaskFormer

Figure 2: MaskFormer overview. We use a backbone to extract image features . A pixel decoder gradually upsamples image features to extract per-pixel embeddings . A transformer decoder attends to image features and produces per-segment embeddings . The embeddings independently generate class predictions with corresponding mask embeddings . Then, the model predicts possibly overlapping binary mask predictions via a dot product between pixel embeddings and mask embeddings followed by a sigmoid activation. For semantic segmentation task we can get the final prediction by combining binary masks with their class predictions using a simple matrix multiplication (see Section 3.4). Note, the dimensions for multiplication are shown in gray.

We now introduce MaskFormer, the new mask classification model, which computes probability-mask pairs . The model contains three modules (see Fig. 2): 1) a pixel-level module that extracts per-pixel embeddings used to generate binary mask predictions; 2) a transformer module, where a stack of Transformer decoder layers vaswani2017attention computes per-segment embeddings; and 3) a segmentation module, which generates predictions from these embeddings. During inference, discussed in Sec. 3.4, and are assembled into the final prediction.

Pixel-level module takes an image of size as input. A backbone generates a (typically) low-resolution image feature map , where is the number of channels and

is the stride of the feature map (

depends on the specific backbone and we use in this work). Then, a pixel decoder gradually upsamples the features to generate per-pixel embeddings , where is the embedding dimension. Note, that any per-pixel classification-based segmentation model fits the pixel-level module design including recent Transformer-based models strudel2021segmenter; zheng2021rethinking; liu2021swin. MaskFormer seamlessly converts such a model to mask classification.

Transformer module uses the standard Transformer decoder vaswani2017attention to compute from image features and learnable positional embeddings (i.e., queries) its output, i.e., per-segment embeddings of dimension that encode global information about each segment MaskFormer predicts. Similarly to detr, the decoder yields all predictions in parallel.

Segmentation module applies a linear classifier, followed by a softmax activation, on top of the per-segment embeddings to yield class probability predictions for each segment. Note, that the classifier predicts an additional “no object” category (

) in case the embedding does not correspond to any region. For mask prediction, a Multi-Layer Perceptron (MLP) with 2 hidden layers converts the per-segment embeddings

to mask embeddings of dimension . Finally, we obtain each binary mask prediction via a dot product between the mask embedding and per-pixel embeddings computed by the pixel-level module. The dot product is followed by a sigmoid activation, i.e., .

Note, we empirically find it is beneficial to not enforce mask predictions to be mutually exclusive to each other by using a softmax activation. During training, the loss combines a cross entropy classification loss and a binary mask loss for each predicted segment. For simplicity we use the same as DETR detr, i.e., a linear combination of a focal loss lin2017focal and a dice loss milletari2016v multiplied by hyper-parameters and respectively.

3.4 Mask-classification inference

First, we present a simple general inference procedure that converts mask classification outputs to either panoptic or semantic segmentation output formats. Then, we describe a semantic inference procedure specifically designed for semantic segmentation. We note, that the specific choice of inference strategy largely depends on the evaluation metric rather than the task.

General inference partitions an image into segments by assigning each pixel to one of the predicted probability-mask pairs via . Here is the most likely class label for each probability-mask pair . Intuitively, this procedure assigns a pixel at location to probability-mask pair only if both the most likely class probability and the mask prediction probability are high. Pixels assigned to the same probability-mask pair form a segment where each pixel is labelled with . For semantic segmentation, segments sharing the same category label are merged; whereas for instance-level segmentation tasks, the index of the probability-mask pair helps to distinguish different instances of the same class. Finally, to reduce false positive rates in panoptic segmentation we follow previous inference strategies detr; kirillov2017panoptic. Specifically, we filter out low-confidence predictions prior to inference and remove predicted segments that have large parts of their binary masks () occluded by other predictions.

Semantic inference is designed specifically for semantic segmentation and is done via a simple matrix multiplication. We empirically find that marginalization over probability-mask pairs, i.e., , yields better results than the hard assignment of each pixel to a probability-mask pair used in the general inference strategy. The argmax does not include the “no object” category () as standard semantic segmentation requires each output pixel to take a label. Note, this strategy returns a per-pixel class probability . However, we observe that directly maximizing per-pixel class likelihood leads to poor performance. We hypothesize, that gradients are evenly distributed to every query, which complicates training.

4 Experiments

We demonstrate that MaskFormer seamlessly unifies semantic- and instance-level segmentation tasks by showing state-of-the-art results on both semantic segmentation and panoptic segmentation datasets. Then, we ablate the MaskFormer design confirming that observed improvements in semantic segmentation indeed stem from the shift from per-pixel classification to mask classification.

Datasets. We study MaskFormer using four widely used semantic segmentation datasets: ADE20K zhou2017ade20k (150 classes) from the SceneParse150 challenge ade20k_sceneparse_150, COCO-Stuff-10K caesar2016coco (171 classes), Cityscapes Cordts2016Cityscapes (19 classes), and Mapillary Vistas neuhold2017mapillary (65 classes). In addition, we use the ADE20K-Full zhou2017ade20k dataset annotated in an open vocabulary setting (we keep 874 classes that are present in both train and validation sets). For panotic segmenation evaluation we use COCO lin2014coco; caesar2016coco; kirillov2017panoptic (80 “things” and 53 “stuff” categories) and ADE20K-Panoptic zhou2017ade20k; kirillov2017panoptic (100 “things” and 50 “stuff” categories). Please see the appendix for detailed descriptions of all used datasets.

Evaluation metrics. For semantic segmentation the standard metric is mIoU (mean Intersection-over-Union) everingham2015pascal, a per-pixel metric that directly corresponds to the per-pixel classification formulation. To better illustrate the difference between segmentation approaches, in our ablations we supplement mIoU with PQ (PQ stuff) kirillov2017panoptic, a per-region metric that treats all classes as “stuff” and evaluates each segment equally, irrespective of its size. We report the median of 3 runs for all datasets, except for Cityscapes where we report the median of 5 runs. For panoptic segmentation, we use the standard PQ (panoptic quality) metric kirillov2017panoptic and report single run results due to prohibitive training costs.

Baseline models. On the right we sketch the used per-pixel classification baselines. The PerPixelBaseline uses the pixel-level module of MaskFormer and directly outputs per-pixel class scores. For a fair comparison, we design PerPixelBaseline+ which adds the transformer module and mask embedding MLP to the PerPixelBaseline. Thus, PerPixelBaseline+ and MaskFormer differ only in the formulation: per-pixel vs. mask classification. Note that, these baselines are for ablation and we compare MaskFormer with state-of-the-art per-pixel classification models as well.

4.1 Implementation details

Backbone. MaskFormer is compatible with any backbone architecture. In our work we use the standard convolution-based ResNet he2016deep backbones (R50 and R101 with 50 and 101 layers respectively) and recently proposed Transformer-based Swin-Transformer liu2021swin backbones. In addition, we use the R101c model deeplabV2 which replaces the first convolution layer of R101 with 3 consecutive convolutions and which is popular in the semantic segmentation community zhao2017pspnet; deeplabV3; deeplabV3plus; huang2019ccnet; yuan2020object; cheng2020panoptic.

Pixel decoder. The pixel decoder in Figure 2 can be implemented using any semantic segmentation decoder (e.g., deeplabV3plus; cheng2019spgnet; cheng2020panoptic). Many per-pixel classification methods use modules like ASPP deeplabV2 or PSP zhao2017pspnet to collect and distribute context across locations. The Transformer module attends to all image features, collecting global information to generate class predictions. This setup reduces the need of the per-pixel module for heavy context aggregation. Therefore, for MaskFormer, we design a light-weight pixel decoder based on the popular FPN lin2016feature architecture.

Following FPN, we upsample the low-resolution feature map in the decoder and sum it with the projected feature map of corresponding resolution from the backbone; Projection is done to match channel dimensions of the feature maps with a convolution layer followed by GroupNorm (GN) wu2018group. Next, we fuse the summed features with an additional

convolution layer followed by GN and ReLU activation. We repeat this process starting with the stride 32 feature map until we obtain a final feature map of stride 4. Finally, we apply a single

convolution layer to get the per-pixel embeddings. All feature maps in the pixel decoder have a dimension of 256 channels.

Transformer decoder. We use the same Transformer decoder design as DETR detr. The query embeddings are initialized as zero vectors, and we associate each query with a learnable positional encoding. We use 6 Transformer decoder layers with 100 queries by default, and, following DETR, we apply the same loss after each decoder. In our experiments we observe that MaskFormer is competitive for semantic segmentation with a single decoder layer too, whereas for instance-level segmentation multiple layers are necessary to remove duplicates from the final predictions.

Segmentation module. The multi-layer perceptron (MLP) in Figure 2 has 2 hidden layers of 256 channels to predict the mask embeddings , analogously to the box head in DETR. Both per-pixel and mask embeddings have 256 channels.

Loss weights. We use focal loss lin2017focal and dice loss milletari2016v for our mask loss: , and set the hyper-parameters to and . Following DETR detr, the weight for the “no object” () in the classification loss is set to 0.1.

4.2 Training settings

Semantic segmentation. We use Detectron2 wu2019detectron2 and follow the commonly used training settings for each dataset. More specifically, we use AdamW loshchilov2018decoupled and the poly deeplabV2 learning rate schedule with an initial learning rate of and a weight decay of for ResNet he2016deep backbones, and an initial learning rate of and a weight decay of for Swin-Transformer liu2021swin

backbones. Backbones are pre-trained on ImageNet-1K 

Russakovsky2015 if not stated otherwise. A learning rate multiplier of is applied to CNN backbones and is applied to Transformer backbones. The standard random scale jittering between and , random horizontal flipping, random cropping as well as random color jittering are used as data augmentation mmseg2020. For the ADE20K dataset, if not stated otherwise, we use a crop size of , a batch size of and train all models for 160k iterations. For the ADE20K-Full dataset, we use the same setting as ADE20K except that we train all models for 200k iterations. For the COCO-Stuff-10k dataset, we use a crop size of , a batch size of 32 and train all models for 60k iterations. All models are trained with 8 V100 GPUs. We report both performance of single scale (s.s.) inference and multi-scale (m.s.) inference with horizontal flip and scales of , , , , , . See appendix for Cityscapes and Mapillary Vistas settings.

Panoptic segmentation. We follow exactly the same architecture, loss, and training procedure as we use for semantic segmentation. The only difference is supervision: i.e., category region masks in semantic segmentation vs. object instance masks in panoptic segmentation. We strictly follow the DETR detr setting to train our model on the COCO panoptic segmentation dataset kirillov2017panoptic for a fair comparison. On the ADE20K panoptic segmentation dataset, we follow the semantic segmentation setting but train for longer (720k iterations) and use a larger crop size (). COCO models are trained using 64 V100 GPUs and ADE20K experiments are trained with 8 V100 GPUs. We use the general inference (Section 3.4) with the following parameters: we filter out masks with class confidence below 0.8 and set masks whose contribution to the final panoptic segmentation is less than 80% of its mask area to VOID. We report performance of single scale inference.

method backbone crop size mIoU (s.s.) mIoU (m.s.) #params. FLOPs fps
CNN backbones OCRNet yuan2020object R101c - 45.3 - - -
DeepLabV3+ deeplabV3plus 0R50c 44.0 44.9 044M 177G 21.0
R101c 45.5 46.4 063M 255G 14.2
MaskFormer (ours) 0R50 44.5 0.5 46.7 0.6 041M 053G 24.5
R101 45.5 0.5 47.2 0.2 060M 073G 19.5
R101c 46.0 0.1 48.1 0.2 060M 080G 19.0
Transformer backbones SETR zheng2021rethinking ViT-L - 50.3 308M - -
Swin-UperNet liu2021swin; xiao2018unified Swin-T - 46.1 060M 236G 18.5
Swin-S - 49.3 081M 259G 15.2
Swin-B - 51.6 121M 471G 08.7
Swin-L - 53.5 234M 647G 06.2
MaskFormer (ours) Swin-T 46.7 0.7 48.8 0.6 042M 055G 22.1
Swin-S 49.8 0.4 51.0 0.4 063M 079G 19.6
Swin-B 52.7 0.4 53.9 0.2 102M 195G 12.6
Swin-L 54.1 0.2 55.6 0.1 212M 375G 07.9
Table 1: Semantic segmentation on ADE20K val with 150 categories. Mask classification-based MaskFormer outperforms the best per-pixel classification approaches while using fewer parameters and less computation. We report both single-scale (s.s.) and multi-scale (m.s.) inference results with std. FLOPs are computed for the given crop size. Frame-per-second (fps) is measured on a V100 GPU with a batch size of 1.444It isn’t recommended to compare fps from different papers: speed is measured in different environments. DeepLabV3+ fps are from MMSegmentation mmseg2020, and Swin-UperNet fps are from the original paper liu2021swin.

 Backbones pre-trained on ImageNet-22K are marked with

.
Cityscapes (19 classes) ADE20K (150 classes) COCO-Stuff (171 classes) ADE20K-Full (847 classes)
mIoU PQ mIoU PQ mIoU PQ mIoU PQ
PerPixelBaseline 77.4 58.9 39.2 21.6 32.4 15.5 12.4 05.8
PerPixelBaseline+ 78.5 60.2 41.9 28.3 34.2 24.6 13.9 09.0
MaskFormer (ours) 78.5 (+0.0) 63.1 (+2.9) 44.5 (+2.6) 33.4 (+5.1) 37.1 (+2.9) 28.9 (+4.3) 17.4 (+3.5) 11.9 (+2.9)
Table 2: MaskFormer vs. per-pixel classification baselines on 4 semantic segmentation datasets. MaskFormer improvement is larger when the number of classes is larger. We use a ResNet-50 backbone and report single scale mIoU and PQ for ADE20K, COCO-Stuff and ADE20K-Full, whereas for higher-resolution Cityscapes we use a deeper ResNet-101 backbone following deeplabV3; deeplabV3plus.

4.3 Main results

Semantic segmentation. In Table 1, we compare MaskFormer with state-of-the-art per-pixel classification models for semantic segmentation on the ADE20K dataset. With the same standard CNN backbones (e.g., ResNet he2016deep), MaskFormer outperforms DeepLabV3+ deeplabV3plus by 1.7 mIoU. MaskFormer is also compatible with recent Vision Transformer dosovitskiy2020vit backbones (e.g., the Swin Transformer liu2021swin), achieving a new state-of-the-art of 55.6 mIoU, which is 2.1 mIoU better than the prior state-of-the-art liu2021swin. Observe that MaskFormer outperforms the best per-pixel classification-based models while having fewer parameters and faster inference time. This result suggests that the mask classification formulation has significant potential for semantic segmentation.

Beyond ADE20K, we further compare MaskFormer with our baselines on COCO-Stuff-10K, ADE20K-Full as well as Cityscapes in Table 2 and we refer to the appendix for comparison with state-of-the-art methods on these datasets. The improvement of MaskFormer over PerPixelBaseline+ is larger when the number of classes is larger: For Cityscapes, which has only 19 categories, MaskFormer performs similarly well as PerPixelBaseline+; While for ADE20K-Full, which has 847 classes, MaskFormer outperforms PerPixelBaseline+ by 3.5 mIoU.

Although MaskFormer shows no improvement in mIoU for Cityscapes, the PQ metric increases by 2.9 PQ. We find MaskFormer performs better in terms of recognition quality (RQ) while lagging in per-pixel segmentation quality (SQ) (we refer to the appendix for detailed numbers). This observation suggests that on datasets where class recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy (i.e., mask quality).

method backbone PQ PQ PQ SQ RQ #params. FLOPs fps
CNN backbones DETR detr 0R50 + 6 Enc 43.4 48.2 36.3 79.3 53.8 - - -
MaskFormer (DETR) 0R50 + 6 Enc 45.6 50.0 (+1.8) 39.0 (+2.7) 80.2 55.8 - - -
MaskFormer (ours) 0R50 + 6 Enc 46.5 51.0 (+2.8) 39.8 (+3.5) 80.4 56.8 045M 0181G 17.6
DETR detr R101 + 6 Enc 45.1 50.5 37.0 79.9 55.5 - - -
MaskFormer (ours) R101 + 6 Enc 47.6 52.5 (+2.0) 40.3 (+3.3) 80.7 58.0 064M 0248G 14.0
Transformer backbones Max-DeepLab wang2021max Max-S 48.4 53.0 41.5 - - 062M 0324G 07.6
Max-L 51.1 57.0 42.2 - - 451M 3692G -
MaskFormer (ours) Swin-T 47.7 51.7 41.7 80.4 58.3 042M 0179G 17.0
Swin-S 49.7 54.4 42.6 80.9 60.4 063M 0259G 12.4
Swin-B 51.1 56.3 43.2 81.4 61.8 102M 0411G 08.4
Swin-B 51.8 56.9 44.1 81.4 62.6 102M 0411G 08.4
Swin-L 52.7 58.5 44.0 81.8 63.5 212M 0792G 05.2
Table 3: Panoptic segmentation on COCO panoptic val with 133 categories. MaskFormer seamlessly unifies semantic- and instance-level segmentation without modifying the model architecture or loss. Our model, which achieves better results, can be regarded as a box-free simplification of DETR detr. The major improvement comes from “stuff” classes (PQ) which are ambiguous to represent with bounding boxes. For MaskFormer (DETR) we use the exact same post-processing as DETR. Note, that in this setting MaskFormer performance is still better than DETR (+2.2 PQ). Our model also outperforms recently proposed Max-DeepLab wang2021max without the need of sophisticated auxiliary losses, while being more efficient. FLOPs are computed as the average FLOPs over 100 validation images (COCO images have varying sizes). Frame-per-second (fps) is measured on a V100 GPU with a batch size of 1 by taking the average runtime on the entire val set including post-processing time. Backbones pre-trained on ImageNet-22K are marked with .

Panoptic segmentation. In Table 3, we compare the same exact MaskFormer model with DETR detr on the COCO panoptic dataset. To match the standard DETR design, we add 6 additional Transformer encoder layers after the CNN backbone. Unlike DETR, our model does not predict bounding boxes but instead predicts masks directly. MaskFormer achieves better results while being simpler than DETR. To disentangle the improvements from the model itself and our post-processing inference strategy we run our model following DETR post-processing (MaskFormer (DETR)) and observe that this setup outperforms DETR by 2.2 PQ. Overall, we observe a larger improvement in PQ compared to PQ. This suggests that detecting “stuff” with bounding boxes is suboptimal, and therefore, box-based segmentation models (e.g., Mask R-CNN he2017mask) do not suit semantic segmentation. MaskFormer also outperforms recently proposed Max-DeepLab wang2021max without the need of special network design as well as sophisticated auxiliary losses (i.e., instance discrimination loss, mask-ID cross entropy loss, and per-pixel classification loss in wang2021max). MaskFormer, for the first time, unifies semantic- and instance-level segmentation with the exact same model, loss, and training pipeline.

We further evaluate our model on the panoptic segmentation version of the ADE20K dataset. Our model also achieves state-of-the-art performance. We refer to the appendix for detailed results.

4.4 Ablation studies

We perform a series of ablation studies of MaskFormer using a single ResNet-50 backbone he2016deep.

Per-pixel vs. mask classification. In Table 4, we verify that the gains demonstrated by MaskFromer come from shifting the paradigm to mask classification. We start by comparing PerPixelBaseline+ and MaskFormer. The models are very similar and there are only 3 differences: 1) per-pixel vs. mask classification used by the models, 2) MaskFormer uses bipartite matching, and 3) the new model uses a combination of focal and dice losses as a mask loss, whereas PerPixelBaseline+ utilizes per-pixel cross entropy loss. First, we rule out the influence of loss differences by training PerPixelBaseline+ with exactly the same losses and observing no improvement. Next, in Table (a)a, we compare PerPixelBaseline+ with MaskFormer trained using a fixed matching (MaskFormer-fixed), i.e., and assignment is done based on category label indices identically to per-pixel classification setup. We observe that MaskFormer-fixed is 1.8 mIoU better than the baseline, suggesting that shifting from per-pixel classification to mask classification is indeed the main reason for the gains of MaskFormer. In Table (b)b, we further compare MaskFormer-fixed with MaskFormer trained with bipartite matching (MaskFormer-bipartite) and find bipartite matching is not only more flexible (allowing to predict less masks than the total number of categories) but also produces better results.

mIoU PQ
PerPixelBaseline+ 41.9 28.3
MaskFormer-fixed 43.7 (+1.8) 30.3 (+2.0)
(a) Per-pixel vs. mask classification.
mIoU PQ
MaskFormer-fixed 43.7 30.3
MaskFormer-bipartite (ours) 44.2 (+0.5) 33.4 (+3.1)
(b) Fixed vs. bipartite matching assignment.
Table 4: Per-pixel vs. mask classification for semantic segmentation. All models use 150 queries for a fair comparison. We evaluate the models on ADE20K val with 150 categories. (a)a: PerPixelBaseline+ and MaskFormer-fixed use similar fixed matching (i.e., matching by category index), this result confirms that the shift from per-pixel to masks classification is the key. (a)a: bipartite matching is not only more flexible (can make less prediction than total class count) but also gives better results.
ADE20K COCO-Stuff ADE20K-Full
# of queries mIoU PQ mIoU PQ mIoU PQ
PerPixelBaseline+ 41.9 28.3 34.2 24.6 13.9 9.0
20 42.9 32.6 35.0 27.6 14.1 10.8
50 43.9 32.7 35.5 27.9 15.4 11.1
100 44.5 33.4 37.1 28.9 16.0 11.9
150 44.2 33.4 37.0 28.9 15.5 11.5
300 43.5 32.3 36.1 29.1 14.2 10.3
1000 35.4 26.7 34.4 27.6 08.0 05.8

Number of queries. The table to the right shows the results of MaskFormer trained with a varying number of queries on datasets with different number of categories. The model with 100 queries consistently performs the best across multiple datasets. This suggest we do not need to adjust the number of queries w.r.t. the number of categories or datasets. Interestingly, even with only 20 queries MaskFormer outperforms our per-pixel classification baseline.

We further calculate the number of classes which are on average present in a training set image. We find these statistics to be similar across datasets despite the fact that the datasets have different number of total categories: 8.2 classes per image for ADE20K (150 classes), 6.6 classes per image for COCO-Stuff-10K (171 classes) and 9.1 classes per image for ADE20K-Full (847 classes). We hypothesize that each query is able to capture masks from multiple categories.

Number of unique classes predicted by each query on validation set
(a) ADE20K (150 classes)
(b) COCO-Stuff-10K (171 classes)
(c) ADE20K-Full (847 classes)

The figure to the right shows the number of unique categories predicted by each query (sorted in descending order) of our MaskFormer model on the validation sets of the corresponding datasets. Interestingly, the number of unique categories per query does not follow a uniform distribution: some queries capture more classes than others. We try to analyze how MaskFormer queries group categories, but we do not observe any obvious pattern: there are queries capturing categories with similar semantics or shapes (

e.g., “house” and “building”), but there are also queries capturing completely different categories (e.g., “water” and “sofa”).

Number of Transformer decoder layers. Interestingly, MaskFormer with even a single Transformer decoder layer already performs well for semantic segmentation and achieves better performance than our 6-layer-decoder PerPixelBaseline+. For panoptic segmentation, however, multiple decoder layers are required to achieve competitive performance. Please see the appendix for a detailed discussion.

5 Conclusion

The paradigm discrepancy between semantic- and instance-level segmentation results in entirely different models for each task, hindering development of image segmentation as a whole. We show that a simple mask classification model can outperform state-of-the-art per-pixel classification models, especially in the presence of large number of categories. Our model also remains competitive for panoptic segmentation, without a need to change model architecture, losses, or training procedure. We hope this unification spurs a joint effort across semantic- and instance-level segmentation tasks.

Acknowledgments.

We thank Ross Girshick for insightful comments and suggestions. Work of UIUC authors BC and AS was supported in part by NSF under Grant #1718221, 2008387, 2045586, MRI #1725729, NIFA award 2020-67021-32799 and Cisco Systems Inc. (Gift Award CG 1377144 - thanks for access to Arcetri).

Appendix

We first provide more information regarding the datasets used in our experimental evaluation of MaskFormer (Appendix A). Then, we provide detailed results of our model on more semantic (Appendix B) and panoptic (Appendix C) segmentation datasets. Finally, we provide additional ablation studies (Appendix D) and visualization (Appendix E).

Appendix A Datasets description

We study MaskFormer using five semantic segmentation datasets and two panoptic segmentation datasets. Here, we provide more detailed information about these datasets.

a.1 Semantic segmentation datasets

ADE20K zhou2017ade20k contains 20k images for training and 2k images for validation. The data comes from the ADE20K-Full dataset where 150 semantic categories are selected to be included in evaluation from the SceneParse150 challenge ade20k_sceneparse_150. The images are resized such that the shortest side is no greater than 512 pixels. During inference, we resize the shorter side of the image to the corresponding crop size.

COCO-Stuff-10K caesar2016coco has 171 semantic-level categories. There are 9k images for training and 1k images for testing. Images in the COCO-Stuff-10K datasets are a subset of the COCO dataset lin2014coco. During inference, we resize the shorter side of the image to the corresponding crop size.

ADE20K-Full zhou2017ade20k contains 25k images for training and 2k images for validation. The ADE20K-Full dataset is annotated in an open-vocabulary setting with more than 3000 semantic categories. We filter these categories by selecting those that are present in both training and validation sets, resulting in a total of 847 categories. We follow the same process as ADE20K-SceneParse150 to resize images such that the shortest side is no greater than 512 pixels. During inference, we resize the shorter side of the image to the corresponding crop size.

Cityscapes Cordts2016Cityscapes is an urban egocentric street-view dataset with high-resolution images ( pixels). It contains 2975 images for training, 500 images for validation, and 1525 images for testing with a total of 19 classes. During training, we use a crop size of , a batch size of 16 and train all models for 90k iterations. During inference, we operate on the whole image ().

Mapillary Vistas neuhold2017mapillary is a large-scale urban street-view dataset with 65 categories. It contains 18k, 2k, and 5k images for training, validation and testing with a variety of image resolutions, ranging from to . During training, we resize the short side of images to 2048 before applying scale augmentation. We use a crop size of , a batch size of and train all models for 300k iterations. During inference, we resize the longer side of the image to 2048 and only use three scales (0.5, 1.0 and 1.5) for multi-scale testing due to GPU memory constraints.

a.2 Panoptic segmentation datasets

COCO panoptic kirillov2017panoptic is one of the most commonly used datasets for panoptic segmentation. It has 133 categories (80 “thing” categories with instance-level annotation and 53 “stuff” categories) in 118k images for training and 5k images for validation. All images are from the COCO dataset lin2014coco.

ADE20K panoptic zhou2017ade20k combines the ADE20K semantic segmentation annotation for semantic segmentation from the SceneParse150 challenge ade20k_sceneparse_150 and ADE20K instance annotation from the COCO+Places challenge coco_places_challenges_2017. Among the 150 categories, there are 100 “thing” categories with instance-level annotation. We find filtering masks with a lower threshold (we use 0.7 for ADE20K) than COCO (which uses 0.8) gives slightly better performance.

method backbone mIoU (s.s.) mIoU (m.s.)
OCRNet yuan2020object R101c - 39.5
PerPixelBaseline 0R50 32.4 0.2 34.4 0.4
PerPixelBaseline+ 0R50 34.2 0.2 35.8 0.4
MaskFormer (ours) 0R50 37.1 0.4 38.9 0.2
R101 38.1 0.3 39.8 0.6
R101c 38.0 0.3 39.3 0.4
(a) COCO-Stuff-10K.
mIoU (s.s.) training memory
- -
12.4 0.2 08030M
13.9 0.1 26698M
16.0 0.3 06529M
16.8 0.2 06894M
17.4 0.4 06904M
(b) ADE20K-Full.
Table I: Semantic segmentation on COCO-Stuff-10K test with 171 categories and ADE20K-Full val with 847 categories. Table (a)a: MaskFormer is competitive on COCO-Stuff-10K, showing the generality of mask-classification. Table (b)b: MaskFormer results on the harder large-vocabulary semantic segmentation. MaskFormer performs better than per-pixel classification and requires less memory during training, thanks to decoupling the number of masks from the number of classes. mIoU (s.s.) and mIoU (m.s.) are the mIoU of single-scale and multi-scale inference with std.
method backbone mIoU (m.s.)
Panoptic-DeepLab cheng2020panoptic X71 chollet2017xception 81.5
OCRNet yuan2020object R101c 82.0
MaskFormer (ours) R101 80.3 0.1
R101c 81.4 0.2
(a) Cityscapes standard mIoU metric.
PQ (m.s.) SQ (m.s.) RQ (m.s.)
66.6 82.9 79.4
66.1 82.6 79.1
65.9 81.5 79.7
66.9 82.0 80.5
(b) Cityscapes analysis with PQ metric suit.
Table II: Semantic segmentation on Cityscapes val with 19 categories. (a)a: MaskFormer is on-par with state-of-the-art methods on Cityscapes which has fewer categories than other considered datasets. We report multi-scale (m.s.) inference results with std for a fair comparison across methods. (b)b: We analyze MaskFormer with a complimentary PQ metric, by treating all categories as “stuff.” The breakdown of PQ suggests mask classification-based MaskFormer is better at recognizing regions (RQ) while slightly lagging in generation of high-quality masks (SQ).
method backbone mIoU (s.s.) mIoU (m.s.)
DeepLabV3+ deeplabV3plus R50 47.7 49.4
HMSANet tao2020hierarchical R50 - 52.2
MaskFormer (ours) R50 53.1 55.4
Table III: Semantic segmentation on Mapillary Vistas val with 65 categories. MaskFormer outperforms per-pixel classification methods on high-resolution images without the need of multi-scale inference, thanks to global context captured by the Transformer decoder. mIoU (s.s.) and mIoU (m.s.) are the mIoU of single-scale and multi-scale inference.
method backbone PQ PQ PQ SQ RQ
BGRNet wu2020bidirectional R50 31.8 - - - -
Auto-Panoptic wu2020auto ShuffleNetV2 ma2018shufflenet 32.4 - - - -
MaskFormer (ours) 0R50 + 6 Enc 34.7 32.2 39.7 76.7 42.8
R101 + 6 Enc 35.7 34.5 38.0 77.4 43.8
Table IV: Panoptic segmentation on ADE20K panoptic val with 150 categories. Following DETR detr, we add 6 additional Transformer encoders when using ResNet he2016deep (R50 + 6 Enc and R101 + 6 Enc) backbones. MaskFormer achieves competitive results on ADE20K panotic, showing the generality of our model for panoptic segmentation.

Appendix B Semantic segmentation results

COCO-Stuff-10K. Table (a)a compares MaskFormer with our baselines as well as the state-of-the-art OCRNet model yuan2020object on the COCO-Stuff-10K caesar2016coco dataset. MaskFormer outperforms our per-pixel classification baselines by a large margin and achieves competitive performances compared to OCRNet. These results demonstrate the generality of the MaskFormer model.

ADE20K-Full. We further demonstrate the benefits in large-vocabulary semantic segmentation in Table (b)b. Since we are the first to report performance on this dataset, we only compare MaskFormer with our per-pixel classification baselines. MaskFormer not only achieves better performance, but is also more memory efficient on the ADE20K-Full dataset with 847 categories, thanks to decoupling the number of masks from the number of classes. These results show that our MaskFormer has the potential to deal with real-world segmentation problems with thousands of categories.

Cityscapes. In Table (a)a, we report MaskFormer performance on Cityscapes, the standard testbed for modern semantic segmentation methods. The dataset has only 19 categories and therefore, the recognition aspect of the dataset is less challenging than in other considered datasets. We observe that MaskFormer performs on par with the best per-pixel classification methods. To better analyze MaskFormer, in Table (b)b, we further report PQ. We find MaskFormer performs better in terms of recognition quality (RQ) while lagging in per-pixel segmentation quality (SQ). This suggests that on datasets, where recognition is relatively easy to solve, the main challenge for mask classification-based approaches is pixel-level accuracy.

Mapillary Vistas. Table III compares MaskFormer with state-of-the-art per-pixel classification models on the high-resolution Mapillary Vistas dataset which contains images up to resolution. We observe: (1) MaskFormer is able to handle high-resolution images, and (2) MaskFormer outperforms mulit-scale per-pixel classification models even without the need of mult-scale inference. We believe the Transformer decoder in MaskFormer is able to capture global context even for high-resolution images.

Appendix C Panoptic segmentation results

We demonstrate the generality of our model for panoptic segmentation on the ADE20K dataset in Table IV, where MaskFormer is competitive with the state-of-the-art methods.

Appendix D Additional ablation studies

We perform additional ablation studies of MaskFormer for semantic segmentation using the same setting as that in the main paper: a single ResNet-50 backbone he2016deep, and we report both the mIoU and the PQ. The default setting of our MaskFormer is: 100 queries and 6 Transformer decoder layers.

ADE20K (150 classes) COCO-Stuff (171 classes) ADE20K-Full (847 classes)
inference mIoU PQ SQ RQ mIoU PQ SQ RQ mIoU PQ SQ RQ
PerPixelBaseline+ 41.9 28.3 71.9 36.2 34.2 24.6 62.6 31.2 13.9 09.0 24.5 12.0
general 42.4 34.2 74.4 43.5 35.5 29.7 66.3 37.0 15.1 11.6 28.3 15.3
semantic 44.5 33.4 75.4 42.4 37.1 28.9 66.3 35.9 16.0 11.9 28.6 15.7
Table V: Inference strategies for semantic segmentation. general: general inference (Section 3.4) which first filters low-confidence masks (using a threshold of 0.3) and assigns labels to the remaining ones. semantic: the default semantic inference (Section 3.4) for semantic segmentation.

Inference strategies. In Table V, we ablate inference strategies for mask classification-based models performing semantic segmentation (discussed in Section 3.4). We compare our default semantic inference strategy and the general inference strategy which first filters out low-confidence masks (a threshold of 0.3 is used) and assigns the class labels to the remaining masks. We observe 1) general inference is only slightly better than the PerPixelBaseline+ in terms of the mIoU metric, and 2) on multiple datasets the general inference strategy performs worse in terms of the mIoU metric than the default semantic inference. However, the general inference has higher PQ, due to better recognition quality (RQ). We hypothesize that the filtering step removes false positives which increases the RQ. In contrast, the semantic inference aggregates mask predictions from multiple queries thus it has better mask quality (SQ). This observation suggests that semantic and instance-level segmentation can be unified with a single inference strategy (i.e., our general inference) and the choice of inference strategy largely depends on the evaluation metric instead of the task.

ADE20K-Semantic ADE20K-Panoptic
# of decoder layers mIoU PQ SQ RQ PQ PQ PQ SQ RQ
6 (PerPixelBaseline+) 41.9 28.3 71.9 36.2 - - - - -
1 43.0 31.1 74.3 39.7 31.9 29.6 36.6 76.6 39.6
6 44.5 33.4 75.4 42.4 34.7 32.2 39.7 76.7 42.8
6 (no self-attention) 44.6 32.8 74.5 41.5 32.6 29.9 38.2 75.6 40.4
Table VI: Ablation on number of Transformer decoder layers in MaskFormer. We find that MaskFormer with only one Transformer decoder layer is already able to achieve reasonable semantic segmentation performance. Stacking more decoder layers mainly improves the recognition quality.

Number of Transformer decoder layers. In Table VI, we ablate the effect of the number of Transformer decoder layers on ADE20K zhou2017ade20k for both semantic and panoptic segmentation. Surprisingly, we find a MaskFormer with even a single Transformer decoder layer already performs reasonably well for semantic segmentation and achieves better performance than our 6-layer-decoder per-pixel classification baseline PerPixelBaseline+. Whereas, for panoptic segmentation, the number of decoder layers is more important. We hypothesize that stacking more decoder layers is helpful to de-duplicate predictions which is required by the panoptic segmentation task.

To verify this hypothesis, we train MaskFormer models without self-attention in all 6 Transformer decoder layers. On semantic segmentation, we observe MaskFormer without self-attention performs similarly well in terms of the mIoU metric, however, the per-mask metric PQ is slightly worse. On panoptic segmentation, MaskFormer models without self-attention performs worse across all metrics.

MaskFormer trained for semantic segmentation MaskFormer trained for panoptic segmentation
ground truth prediction ground truth prediction
semantic query prediction panoptic query prediction
Figure I: Visualization of “semantic” queries and “panoptic” queries. Unlike the behavior in a MaskFormer model trained for panoptic segmentation (right), a single query is used to capture multiple instances in a MaskFormer model trained for semantic segmentation (left). Our model has the capacity to adapt to different types of tasks given different ground truth annotations.

“Semantic” queries vs. “panoptic” queries. In Figure I we visualize predictions for the “car” category from MaskFormer trained with semantic-level and instance-level ground truth data. In the case of semantic-level data, the matching cost and loss used for mask prediction force a single query to predict one mask that combines all cars together. In contrast, with instance-level ground truth, MaskFormer uses different queries to make mask predictions for each car. This observation suggests that our model has the capacity to adapt to different types of tasks given different ground truth annotations.

ground truth prediction ground truth prediction

width=

width=

width=

width=

width=

width=

Figure II: Visualization of MaskFormer semantic segmentation predictions on the ADE20K dataset. We visualize the MaskFormer with Swin-L backbone which achieves 55.6 mIoU (multi-scale) on the validation set. First and third columns: ground truth. Second and fourth columns: prediction.

Appendix E Visualization

We visualize sample semantic segmentation predictions of the MaskFormer model with Swin-L liu2021swin backbone (55.6 mIoU) on the ADE20K validation set in Figure II.

References