Large realistic datasets [cordts16cvpr, neuhold17iccv] have immensely contributed to the development of techniques for semantic segmentation of road-driving scenes. The reported accuracies grew rapidly over the last five years. However the researchers soon realized that the learned models were often doing poorly in the wild. Hence there emerged a desire to design models which simultaneously address multiple datasets from several domains [kreso18arxiv, lambert20cvpr], akin to combined events in athletics. The second Robust Vision Challenge proposes a contest over 7 semantic segmentation datasets across three domains: interiors, photographs, road-driving. Each participant has to submit the same model to all 7 datasets. This report presents the main insights which we gathered while participating in the challenge.
Besides knowledge transfer from ImageNet, we train exclusively on the seven datasets of the challenge which we summarize in Table1
. ADE20k contains rather small indoor and outdoor images, and has the most classes. ScanNet contains interior images with very noisy labels. Cityscapes, KITTI, Vistas, VIPER, and WildDash 2 contain road-driving images. Cityscapes contains images from western Europe taken with the same camera in fine weather. Vistas contains crowdsourced images across the globe in all kinds of weather. WildDash 2 collects hand-picked images according to a system of hazards[zendel18eccv]. VIPER contains images generated by a computer game. However, its labeling is inconsistent with the remaining road-driving datasets. E.g. a person seen through a windscreen is labeled as class person in VIPER, while other road-driving datasets label such pixels with a suitable vehicle class.
|WildDash 2||driving||4256||26 -||20||1440||0|
Dataset summary. Size denotes the total number of annotated non-test images. Class count denotes the total number of training and test classes. Resolution denotes the mean and standard deviation of the square root of the number of pixels () across the training split.
3 Universal taxonomy
We consider semantic classes as sets of pixels in all possible images and borrow set notation to express relations between them. We build a flat universal taxonomy defined by mappings from dataset-specific classes to subsets of a universal set of disjoint elementary classes [lambert20cvpr]. The taxonomy can be produced by iterative application of the following rules.
If a class exactly matches another class, they are merged.
Example: results in dataset-specific mappings and .
If a class is a subset of another class, , the superset class is replaced with the difference: . Dataset mappings are updated by replacing with .
Example (abstraction): results in and .
Example (composition): results in and .
If two classes are overlapping, , they are split into , , and , so that there is no overlap. Dataset-specific mappings are updated by replacing with and with .
Example: and results in and (note that truck, pickup and trailer are disjoint).
However, there are cases where applying the rule 3 would make the taxonomy too complex. For instance, Vistas labels vehicle windows as vehicles, but VIPER labels these pixels with what is seen behind. We ignore the overlap by making simplifying assumptions such as and . This particular issue can not be properly resolved without relabeling.
In practice, we proceed as follows. First, we fuse classes from Vistas (road-driving dataset with the finest granularity) and ADE20K (the dataset with most classes). Subsequently, we add classes from other datasets which require finer granularity than in the existing universal set. This results in one-to-many mappings from each dataset to the universal set of 193 elementary classes. We provide the source code for mapping dataset classes to the universal taxonomy111 https://drive.google.com/drive/folders/1Wi4Uku2ERaciLAVlUKCXVCozZWooZ27L .
Our convolutional model receives a colour image and produces dense predictions into 193 universal classes. We use a SwiftNet architecture with pyramidal fusion [orsic20pr]
. We apply a shared ResNet-152 backbone at three levels of a Gaussian resolution pyramid and use 256 feature maps along the upsampling path. Our predictions are 8 times subsampled with respect to the input resolution in order to decrease the memory footprint. We produce the logits at full resolution by 8
bilinear upsampling, and recover predictions by summing softmax probabilities of all universal classes which map to a particular dataset class.
We recover the probability of the void class by summing probabilities of all universal classes which do not map to the particular dataset. We obtain crisp predictions by applying argmax over all test classes, except on WildDash 2 where we apply the argmax over all test classes plus the void class.
We train the model with a compound loss which modulates the negative log-likelihood in order to prioritize poorly classified pixels and pixels at boundaries[orsic20pr]. We avoid caching logits at full resolution by implementing the log-sum-prob loss as a layer with a custom backprop step. This decreases the memory footprint during training by 2.54 GB per GPU.
5 Training details
We train our submission on 6 Tesla V100 GPUs with 32GB RAM. We use random horizontal flipping, scale jittering and square cropping according to the schedule from Table 2. We attempt to alleviate noisy ScanNet labels by setting the boundary modulation to 1 (minimum) for all ScanNet crops.
We optimize our model with Adam. The learning rate is attenuated by cosine annealing from to
. We freeze all batch normalization layers at epoch 50 and train for 3 more epochs. Once we freeze the batchnorm, we reset the gradient moments used by Adam. The training involves 140k iterations, which took around 4 days on our hardware.
We sample crops from 93,369 training images from all datasets. We favour fair representation of classes and datasets by composing mini-batches with roulette wheel sampling. In particular, we encourage sampling of images with multiple class instances and images with rare classes.
The biggest challenges were the sheer extent of training data and the details of multi-GPU implementation. We do not use batchnorm syncronization for simplicity and speed of training. Accordingly, the population statistics are updated on only one GPU. Conversely, we perform model updates by accumulating gradients from all GPUs. We did not find enough time to implement gradient checkpointing in the multithreaded environment. Hence, we based our solution on a backbone which provides a reasonable performance without checkpointing. We did not use photometric jittering since we were not sure that our model has enough capacity for such recognition problem. We have had to avoid training on unlabeled images with unsupervised loss in order to meet the deadline.
|Epochs||crop size||batch size||jitter range||speed|
|0 – 15||384||616||0.75 – 1.33||45 fps|
|16 – 31||512||68||0.60 – 1.67||27 fps|
|32 – 49||768||64||0.50 – 2.00||14 fps|
|50 – 52||1024||62||0.40 – 2.50||9 fps|
We evaluate all test images on only three scales due to limited time. Pixels where argmax correponds to the void class occur on books (ScanNet), railroad (KITTI), cobblestone (Cityscapes) etc. We find most such pixels in ScanNet (9.97%), Cityscapes (7.84%), and WildDash (7.01%), and least in Vistas (0.04%). Table 3 summarizes mIoU performance of the only two complete RVC 2020 submissions. Poor Cityscapes performance of our method is likely caused by mapping all WildDash trains to only one of the two universal classes corresponding to on-rail vehicles. We corrected that in the 32nd epoch of training. Otherwise, our submission prevails on all datasets except on ADE20k, which has the smallest images. This suggests advantage of pyramidal fusion on large input resolutions.
We have described our submission to the semantic segmentation contest of the Robust Vision Challenge 2020. Our model outputs dense predictions into 193 universal classes, which requires enormous quantity of GPU memory during training. This suggests that memory consumption represents a dominant obstacle towards accurate multi-domain dense prediction. The reported results indicate that pyramidal fusion is capable to produce competitive performance in large-scale setups.