Multi-domain semantic segmentation with pyramidal fusion

by   Marin Oršić, et al.

We present our submission to the semantic segmentation contest of the Robust Vision Challenge held at ECCV 2020. The contest requires submitting the same model to seven benchmarks from three different domains. Our approach is based on the SwiftNet architecture with pyramidal fusion. We address inconsistent taxonomies with a single-level 193-dimensional softmax output. We strive to train with large batches in order to stabilize optimization of a hard recognition problem, and to favour smooth evolution of batchnorm statistics. We achieve this by implementing a custom backward step through log-sum-prob loss, and by using small crops before freezing the population statistics. Our model ranks first on the RVC semantic segmentation challenge as well as on the WildDash 2 leaderboard. This suggests that pyramidal fusion is competitive not only for efficient inference with lightweight backbones, but also in large-scale setups for multi-domain application.



There are no comments yet.


page 1

page 2


Virtual Multi-view Fusion for 3D Semantic Segmentation

Semantic segmentation of 3D meshes is an important problem for 3D scene ...

MSeg: A Composite Dataset for Multi-domain Semantic Segmentation

We present MSeg, a composite dataset that unifies semantic segmentation ...

Robust Semantic Segmentation with Ladder-DenseNet Models

We present semantic segmentation experiments with a model capable to per...

Reimagine BiSeNet for Real-Time Domain Adaptation in Semantic Segmentation

Semantic segmentation models have reached remarkable performance across ...

Dilated SpineNet for Semantic Segmentation

Scale-permuted networks have shown promising results on object bounding ...

Modular Sensor Fusion for Semantic Segmentation

Sensor fusion is a fundamental process in robotic systems as it extends ...

Robust Vision Challenge 2020 – 1st Place Report for Panoptic Segmentation

In this technical report, we present key details of our winning panoptic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large realistic datasets [cordts16cvpr, neuhold17iccv] have immensely contributed to the development of techniques for semantic segmentation of road-driving scenes. The reported accuracies grew rapidly over the last five years. However the researchers soon realized that the learned models were often doing poorly in the wild. Hence there emerged a desire to design models which simultaneously address multiple datasets from several domains [kreso18arxiv, lambert20cvpr], akin to combined events in athletics. The second Robust Vision Challenge proposes a contest over 7 semantic segmentation datasets across three domains: interiors, photographs, road-driving. Each participant has to submit the same model to all 7 datasets. This report presents the main insights which we gathered while participating in the challenge.

2 Datasets

Besides knowledge transfer from ImageNet, we train exclusively on the seven datasets of the challenge which we summarize in Table 


. ADE20k contains rather small indoor and outdoor images, and has the most classes. ScanNet contains interior images with very noisy labels. Cityscapes, KITTI, Vistas, VIPER, and WildDash 2 contain road-driving images. Cityscapes contains images from western Europe taken with the same camera in fine weather. Vistas contains crowdsourced images across the globe in all kinds of weather. WildDash 2 collects hand-picked images according to a system of hazards

[zendel18eccv]. VIPER contains images generated by a computer game. However, its labeling is inconsistent with the remaining road-driving datasets. E.g. a person seen through a windscreen is labeled as class person in VIPER, while other road-driving datasets label such pixels with a suitable vehicle class.

Dataset content size class count resolution
ADE20K photos 22210 150 - 150 460 154
Cityscapes driving 3475 28 - 19 1448 0
KITTI driving 200 28 - 19 682 1
VIPER artificial 18326 32 - 19 1440 0
ScanNet interior 24902 40 - 20 1109 78
Vistas driving 20000 65 - 65 2908 608
WildDash 2 driving 4256 26 - 20 1440 0
Table 1:

Dataset summary. Size denotes the total number of annotated non-test images. Class count denotes the total number of training and test classes. Resolution denotes the mean and standard deviation of the square root of the number of pixels (

) across the training split.

3 Universal taxonomy

We consider semantic classes as sets of pixels in all possible images and borrow set notation to express relations between them. We build a flat universal taxonomy defined by mappings from dataset-specific classes to subsets of a universal set of disjoint elementary classes [lambert20cvpr]. The taxonomy can be produced by iterative application of the following rules.

  1. If a class exactly matches another class, they are merged.
    Example: results in dataset-specific mappings and .

  2. If a class is a subset of another class, , the superset class is replaced with the difference: . Dataset mappings are updated by replacing with .
    Example (abstraction): results in and .
    Example (composition): results in and .

  3. If two classes are overlapping, , they are split into , , and , so that there is no overlap. Dataset-specific mappings are updated by replacing with and with .
    Example: and results in and (note that truck, pickup and trailer are disjoint).

However, there are cases where applying the rule 3 would make the taxonomy too complex. For instance, Vistas labels vehicle windows as vehicles, but VIPER labels these pixels with what is seen behind. We ignore the overlap by making simplifying assumptions such as and . This particular issue can not be properly resolved without relabeling.

In practice, we proceed as follows. First, we fuse classes from Vistas (road-driving dataset with the finest granularity) and ADE20K (the dataset with most classes). Subsequently, we add classes from other datasets which require finer granularity than in the existing universal set. This results in one-to-many mappings from each dataset to the universal set of 193 elementary classes. We provide the source code for mapping dataset classes to the universal taxonomy111 .

4 Method

Our convolutional model receives a colour image and produces dense predictions into 193 universal classes. We use a SwiftNet architecture with pyramidal fusion [orsic20pr]

. We apply a shared ResNet-152 backbone at three levels of a Gaussian resolution pyramid and use 256 feature maps along the upsampling path. Our predictions are 8 times subsampled with respect to the input resolution in order to decrease the memory footprint. We produce the logits at full resolution by 8

bilinear upsampling, and recover predictions by summing softmax probabilities of all universal classes which map to a particular dataset class.

We recover the probability of the void class by summing probabilities of all universal classes which do not map to the particular dataset. We obtain crisp predictions by applying argmax over all test classes, except on WildDash 2 where we apply the argmax over all test classes plus the void class.

We train the model with a compound loss which modulates the negative log-likelihood in order to prioritize poorly classified pixels and pixels at boundaries

[orsic20pr]. We avoid caching logits at full resolution by implementing the log-sum-prob loss as a layer with a custom backprop step. This decreases the memory footprint during training by 2.54 GB per GPU.

5 Training details

We train our submission on 6 Tesla V100 GPUs with 32GB RAM. We use random horizontal flipping, scale jittering and square cropping according to the schedule from Table 2. We attempt to alleviate noisy ScanNet labels by setting the boundary modulation to 1 (minimum) for all ScanNet crops.

We optimize our model with Adam. The learning rate is attenuated by cosine annealing from to

. We freeze all batch normalization layers at epoch 50 and train for 3 more epochs. Once we freeze the batchnorm, we reset the gradient moments used by Adam. The training involves 140k iterations, which took around 4 days on our hardware.

We sample crops from 93,369 training images from all datasets. We favour fair representation of classes and datasets by composing mini-batches with roulette wheel sampling. In particular, we encourage sampling of images with multiple class instances and images with rare classes.

The biggest challenges were the sheer extent of training data and the details of multi-GPU implementation. We do not use batchnorm syncronization for simplicity and speed of training. Accordingly, the population statistics are updated on only one GPU. Conversely, we perform model updates by accumulating gradients from all GPUs. We did not find enough time to implement gradient checkpointing in the multithreaded environment. Hence, we based our solution on a backbone which provides a reasonable performance without checkpointing. We did not use photometric jittering since we were not sure that our model has enough capacity for such recognition problem. We have had to avoid training on unlabeled images with unsupervised loss in order to meet the deadline.

Epochs crop size batch size jitter range speed
0 – 15 384 616 0.75 – 1.33 45 fps
16 – 31 512 68 0.60 – 1.67 27 fps
32 – 49 768 64 0.50 – 2.00 14 fps
50 – 52 1024 62 0.40 – 2.50 9 fps
Table 2: Mini-batch configuration schedule across the training epochs. The columns show square crop size, batch size, and the range of uniform scale jittering. The final column shows how many crops are processed in each second of training.

6 Results

We evaluate all test images on only three scales due to limited time. Pixels where argmax correponds to the void class occur on books (ScanNet), railroad (KITTI), cobblestone (Cityscapes) etc. We find most such pixels in ScanNet (9.97%), Cityscapes (7.84%), and WildDash (7.01%), and least in Vistas (0.04%). Table 3 summarizes mIoU performance of the only two complete RVC 2020 submissions. Poor Cityscapes performance of our method is likely caused by mapping all WildDash trains to only one of the two universal classes corresponding to on-rail vehicles. We corrected that in the 32nd epoch of training. Otherwise, our submission prevails on all datasets except on ADE20k, which has the smallest images. This suggests advantage of pyramidal fusion on large input resolutions.

Dataset MSeg1080_RVC SN_RN152pyrx8_RVC (ours)
ADE20K 33.2 31.1
Cityscapes 80.7 74.7
KITTI 62.6 63.9
Vistas 34.2 40.4
ScanNet 48.5 54.6
VIPER 40.7 62.5
WildDash 2 35.2 45.4
Table 3: Performance (mIoU) of the two RVC’20 submissions.

7 Conclusion

We have described our submission to the semantic segmentation contest of the Robust Vision Challenge 2020. Our model outputs dense predictions into 193 universal classes, which requires enormous quantity of GPU memory during training. This suggests that memory consumption represents a dominant obstacle towards accurate multi-domain dense prediction. The reported results indicate that pyramidal fusion is capable to produce competitive performance in large-scale setups.